#49 Data Science Tool Building (with Wes McKinney)

The Future of Data Science and Open Source Software Development

As we continue to navigate the complexities of data science and machine learning, it's essential to consider the role of open source software development in shaping the future of these fields. According to Wes McKinney, founder of Pandas, one of the most popular data analysis tools in Python, the ability to leave data in place and choose a user interface is crucial for effective data analysis. This means that users should be able to select the programming language that best suits their needs, whether it's interactivity or software development.

One potential solution to this problem is the concept of defragmentation, where multiple programming languages can work together seamlessly with common libraries and algorithms. The Arrow project is already working towards this goal, aiming to create a more consistent user experience for data scientists. By sharing ideas, use cases, and feedback on open source projects, individuals can contribute to the development of more efficient and scalable software. This not only helps to improve the overall quality of open source projects but also fosters a sense of community among developers.

The rise of open source software development has significant implications for the future of data science. According to Wes McKinney, programming languages will become less important relative to data itself and common computational libraries. In other words, as data becomes increasingly prevalent in our lives, we need tools that can effectively process and analyze large datasets efficiently. This is where portable data structures come into play. These structures are designed to be accessible from multiple programming languages, leveraging the vast computational power of modern hardware.

One of the key challenges facing open source software development is the lack of funding. As Wes McKinney pointed out, if corporations were to effectively tithe a portion of their profits to fund open source projects, the funding and sustainability crisis would essentially go away. This would not only improve the overall quality of open source projects but also enable developers to focus on more pressing issues.

In addition to funding, another critical aspect of open source software development is transparency and auditability. As Wes McKinney mentioned, both review code and discussions about design and roadmap are valuable contributions that can help steer discussions about future scope. By engaging with these communities and providing feedback, individuals can help build consensus and prioritize the work being done in open source projects.

The current lack of fairness in artificial intelligence is a pressing issue that requires attention from data scientists and developers. According to Kathy O'Neill, data scientist, investigative journalist, consultant, algorithmic auditor, and author of "Weapons of Mass Destruction," algorithms can perpetuate societal biases if they are not designed with fairness in mind. Transparency and auditability of algorithms are essential for creating more equitable systems.

In conclusion, the future of data science and open source software development holds much promise. By working together to create more defragmented and consistent user experiences, we can harness the power of modern hardware and leverage common computational libraries to drive innovation forward. The rise of portable data structures and the importance of transparency and auditability will be critical components in shaping a fairer future for AI.

The Role of Open Source Software Development in Data Science

Open source software development is playing an increasingly important role in shaping the future of data science. As Wes McKinney, founder of Pandas, noted, the ability to leave data in place and choose a user interface is crucial for effective data analysis. This means that users should be able to select the programming language that best suits their needs, whether it's interactivity or software development.

The rise of open source software development has significant implications for the future of data science. As Wes McKinney pointed out, programming languages will become less important relative to data itself and common computational libraries. In other words, as data becomes increasingly prevalent in our lives, we need tools that can effectively process and analyze large datasets efficiently. This is where portable data structures come into play. These structures are designed to be accessible from multiple programming languages, leveraging the vast computational power of modern hardware.

The Importance of Portable Data Structures

Portable data structures are designed to be accessible from multiple programming languages, leveraging the vast computational power of modern hardware. These structures are critical components in shaping a fairer future for AI. As Wes McKinney pointed out, as data becomes increasingly prevalent in our lives, we need tools that can effectively process and analyze large datasets efficiently.

Portable data structures are designed to be adaptable and flexible, allowing developers to work with different programming languages and frameworks. This not only improves the overall quality of open source projects but also enables developers to focus on more pressing issues. By working together to create more defragmented and consistent user experiences, we can harness the power of modern hardware and leverage common computational libraries to drive innovation forward.

The Future of Open Source Software Development

As open source software development continues to evolve, it's essential to consider its role in shaping the future of data science. According to Wes McKinney, the ability to leave data in place and choose a user interface is crucial for effective data analysis. This means that users should be able to select the programming language that best suits their needs, whether it's interactivity or software development.

The rise of portable data structures will be critical components in shaping a fairer future for AI. As Wes McKinney pointed out, as data becomes increasingly prevalent in our lives, we need tools that can effectively process and analyze large datasets efficiently. Portable data structures are designed to be adaptable and flexible, allowing developers to work with different programming languages and frameworks.

By working together to create more defragmented and consistent user experiences, we can harness the power of modern hardware and leverage common computational libraries to drive innovation forward. The future of open source software development holds much promise, and it's essential that we continue to invest in its growth and development.

"WEBVTTKind: captionsLanguage: enWow in this episode of data framed at data count podcast i'm speaking with wes mckinney creator of the pandas project for data analysis tools in python and author of python for data analysis among many other things Wes and I will talk about data science tool building what it took to get pandas off the ground and how he approaches building human interfaces to data to make individuals more productive on top of this we'll talk about the future of data science tooling including the Apache ro project and how it can facilitate this future the importance of data frames that are portable between programming languages and building tools that facilitate data analysis work in the Big Data limit pandas initially arose from wares noticing that people were nowhere near as productive as they could be due to a lack of tooling and the projects he's working on today which we'll discuss arise from the same place and present a bold vision for the future Wes also makes clear that pandas is not a one human show and when people thank him for his work he reminds them to thank Jeffrey back Joris vandenBosch a-- philip cloud and calm alex Berger along with the other pandas core developers that have really been driving the project forward over the last five years I for one want to thank all of them finally and not to give too much away we'll also discuss the challenges of open source software development and how wears is approaching funding and resourcing OSS with his most recent venture versa labs find out all this and more including how much of pandas was developed in a small East Village apartment that West may or may not have cohabitated with mice well I can't give too much away this is just the opening monologue I'm Hugo Baron Anderson a data scientist the data camp and this is data frame welcome to data frame a weekly data count podcast exploring what data science looks like on the ground for working data scientists and what problems are consulting I'm your host Hugo Bound Anderson you can follow me on Twitter that's you go down and data camp at data camp you can find all our episodes and show notes at data camp comm slash community slash podcast listeners as always check out the show notes for more material on the conversation today I've also included a survey in the show notes and a link to a forum where you can make suggestions for future episodes I'd really appreciate it if you take the survey so I can make sure that we're producing episodes that you want to hear now back to our regularly scheduled programming in Python hi there Wiz and welcome to data framed thank you thanks for having me great pleasure to have you on the show and I'm really excited to have you here today to talk about open source and software development to talk about your work at Ursa labs Apache arrow a bunch of other things to do with tool building but first I'd like to find out a bit about you perhaps you could open by telling me what you're known for in the data community sure yeah I mean I'm best known for for being the the original author of the Python panda's project which I understand that a lot of people use so I started building that as a closed source library a little over 10 years ago and I've been working on a lot of different open source projects for for the Python data science world world and beyond I also wrote a book called Python for data analysis which is now in its second edition and I think that's become a pretty pretty ubiquitous resource for people that are getting into the field of data science and are wanting to learn how to use pandas and you know get their feet wet with working with data and congrats on the second edition that was in the past year or so that was released right yeah I was yeah the end of just about a year ago the end of 2017 and how did you get into data science tool building originally because I'm aware that your background wasn't in in CS per se right I was a mathematician so I studied at a pure math at MIT I did a little bit of a little bit of computer science I had had some exposure to the world of machine learning in that I was aware that it existed but you know MIT didn't have a statistics to program so data analysis and statistics wasn't very familiar to me when I was entering the working world and I got a job at an investment manager called AQR Capital Management which is based in Greenwich Connecticut and there were a number of MIT grads that had gone to work there and some of them were math majors and they kind of sold me on the idea of getting experience with applied math and then maybe I would go back to grad school later on and I found that in my job there that rather than doing very much applied math that I was really doing a lot of data munging so I was writing sequel I was using Excel and really I just found that um I wasn't as productive and efficient working with the data as I felt like I should have been and part of it was like well okay I'm just just starting out my career I'm 22 years old you know what do I know but I looked around and even even at people who were a lot more senior to me and a lot more experienced and it seemed like they weren't very productive either and they were spending a lot of time you know obviously their skill with Excel and cell shortcuts and so forth keyboard shortcuts was a lot better than mine but still it seemed like there was just something missing to kind of working with data and I started to learn I started to learn are at the end of 2007 beginning of 2008 and there were at that point in time you know the our ecosystem was a lot less mature and it felt like you know an interesting language for you know valuable language for doing statistics and data analysis but we also needed to kind of build software and so I learned a little bit of Python and thought like wow this is a really easy to use programming language I had done some Java programming and thought that I just wasn't very good at Java and so I thought man I'm just not cut out for building software but I decided to have a tinker with building some data manipulation tools in Python that was March April 2008 and you know just went down the the rabbit hole from there and what once I had made myself more productive working with data I started evangelizing the as I was building to my colleagues and you know I kept pulling on one thread and ended up becoming more of a software engineer in uh finance or you know a finance or math person yeah there are a lot of interesting touch points there for example your background in pure math and that you're in Connecticut I actually I was working in pure math and ended up doing applied maths in a biology lab in New Haven Connecticut not not in Greenwich but at that point I actually started dealing with data a lot as well and that's when I started getting into data science also it's also interesting that pandas when you first developed it was close source but before we get there you've spoken a bit too you know why why you chose Python could you explain a bit more about what was attractive about Python then because of course one of the a lot of the attractive things for researchers and data scientists now about python is the data science staff you know pandas scikit-learn an umpire or all of these things so what-what made you really like it back in the day yeah I mean at that at that point of time you know 2007-2008 in terms of doing statistical computing Python was not you know let's let's think of it as a promising world that has not yet been terraformed so I think that there were kind of the nuts and bolts of a really interesting environment you know I learned about the ipython project and said you know okay here's a here's a really nice interactive shell where you can plot things and you know it has tab completion and you know really basic interactive affordances that really help out a lot you had the nuts and bolts of the of doing all the analytical computing that you need to do for data manipulation numpy had it's a 1.0 release I think in 2006 and it had become a mature project and the you know scientific Python world was defragmenting itself after the number a numeric rift which had persisted for several years and you know Travis Travis olifant had sort of worked to bring those communities together but really I think what attracted me to the language was the accessibility and the fact that it was really very suited for interactive and exploratory computing that you could you didn't have to set up an elaborate development environment you know an IDE to be able to get up and running doing some really basic things and so having had experience with Java I think one of the things that put about Java was the elaborateness of the environment that you need to really be productive like you really need to set up an ID and if there's all this tooling that you need to do whereas with Python you could do you know some pretty complex things with a few lines of code in a text file and then you just run the script so that kind of like interactive scripting feel of doing exploratory computing was really compelling to me at the time but obviously it was Python was missing a lot of tools and so it was it was a bit daunting to start the process of building some of those tools from scratch yeah and you mentioned I Python numpy and Travis and I suppose you know this is the time where John Hunter was working a lot on on matplotlib and working with Fernando to incorporate it with with ipython there was a lot of close collaboration I suppose this speaks to the idea of community as well and did you find the scientific Python community something that would that was also attractive yeah well I you know I didn't have much interaction with the community until much later I think the first person there's two people that I met from the Python community who are like my first point of contact with that world so so one person is Erik Jones who is a founder of n thought which is the original like Python scientific computing company based in Austin Texas and I also run the sci-fi conference yes they run Syfy and and so n thought was doing a lot of I was doing a lot of consulting work in New York City with financial firms that were getting big into Python during that era like training and kind of custom development and I got in touch with Erik sometime during 2009 and and sort of gave him like kind of the very first external demo of pandas and this was right around the time that we were getting ready to publish the pandas bits on pipe I and so forth kind of the first open-source version of the project and then the the second person I met was John Hunter and himself from from matplotlib I met him in Chicago in January 2010 you know at that point I was looking around for like how to engage with Python world having you know just open sourced pandas and because John was working he worked for trade link and up until his death in 2012 he was a quant there having been a neuroscientist and and kind of had the building matplotlib for many years he he kind of took me under his wing as kind of I was his he was my mentor for you know for a couple of years and kind of helped me enter and get involved in the community and so I definitely feel that the eye fountain found it a very warm and kind of inviting community very collaborative and collegial and I think I was attracted to that you know that feeling you know it didn't seem like a lot of people competing with each other it was really just a lot of pragmatic software developers looking to kind of build tools that were useful and to help each other help each other succeed yeah and you actually still get the sense that when you go to site buy-in in Austin Texas every every July or every second year you you get still get a strong sense of community and people just loving building the tools together yeah yeah totally totally I mean obviously the community has grown much bigger and I think the the ratio of project developers people working on the open-source projects to the users that ratio has certainly changed a lot and that there are a lot more users now and there are developers you know I think the very first sci-fi conference was probably the majority of people there were people who are the developers of open-source projects but you know still I think it's a great community and I think that's that's helping kind of continue to bring people into the ecosystem yeah and actually I had Bryan Granger on the podcast recently and we discussed those you know several people discussing at the moment that we're now kind of entering a phase transition from having individual users spread across all sand spread across the globe of a lot of open source packages to actually having large-scale institutional adoption right yeah yeah for sure and I'm wondering in terms of pandas starting off as a project I'm under the impression it was started as a tool to be used in finance is that the case yeah I mean it was focused so if you look you can go back and download pandas zero point one which was published up IPI in December 2009 and see what was in the library and compared with now the functionality was a lot more geared towards time series data and the kinds of problems that we were dealing with back at AQR I wouldn't say that it necessarily is finance specific is very general data manipulation it was a pretty small project back then but it was just about dealing with tabular data dealing with messy data data munging Mis kind of data alignment essentially kind of all those like kind of really basic wrangling and data integration problems it wasn't really until 2011-2012 that the project got built like I built the project out and created a more comprehensive set relational algebra facilities like it didn't have complete joins like all the different kinds of basic joins until 2011 so its features that was certainly skewed by the use cases that we had in front of us back in AQR and how did you get the project off the ground I know that's a relatively ill-formed question but just in terms of hours and people and resources well you know you smelled metal or you kind of you forage to weapons you have to get the the crucible really really hot so we open sourced the project at the end of 2009 and I think we had deliberated kind of the whether or not to open source at all for about six months or so and ultimately powers that be decided that we would open source pandas and sort of see what would happen I gave my very first talk about pandas you can still find online at PyCon 2010 and in Atlanta and it was about using the subject of the talk was about using Python and quantitative finance but the the project didn't really go anywhere after that so it was on hosted on Google code this was you know could have existed but it was kind of a ruby thing at that time and I left a QR to go back to grad school I went to Duke to start a PhD in statistics statistical science it's called there and I continued to do a little bit of contract work developing pandas for a QR and somewhere I don't I think the catalyst for me was in early 2011 I started to get contacted by more companies that were exploring using Python for data analysis use cases and they had seen my talk at PyCon and were interested in getting my perspective on statistical computing and I just had this feeling that the ecosystem was facing a sort of existential crisis about whether or not it was going to become truly relevant for doing statistics it was clear to me that pandas was promising but really had not reached a level of functional completeness or usefulness to be the foundation of a statistical computing ecosystem in Python and so I guess I felt that feeling so strongly that I you know I sort of had like an epiphany where it wasn't quite like you know shouting Eureka and jumping out of the bathtub but I emailed my advisor and said hey I would like to take take a year off from my PhD and go explore this Python programming stuff and we'll see how it goes then I had some money saved from my first job and I I moved back to New York into a tiny apartment in the East Village which had mice and stuff really not the best place I've ever lived but I you know essentially was like I'm just gonna work full-time on pandas for a while and build it out and see see what happens and I think that's when as soon as started kind of socializing the functionality of pandas and filling in future gaps you know implementing joins and fixing some of the internal issues of course I created other internal problems but there were some there were definitely some design problems in the early versions of pandas that got fixed in the summer of 2011 but as soon as pandas could read CSV files pretty reliably and could do joins and a lot of the basic stuff that you need to be productive working with multiple data sets I think that's when it started to catch people's eye toward the end of 2011 and starts to take off the ground so around the same time I pitched the idea of a data analysis book in Python to O'Reilly and they agreed to do a book which looked thinking back on it was a bit risky because you know who knows what would have become of pandas was not at all a prompt you know not obviously going to be successful back in 2011 so they decided to take a bet and so much so that you know I asked them later why they didn't put a panda on the cover but they said well we're saving the Panda for something really big and so it wasn't even clear then that Python and pandas and everything was going to be a popular thing so it's important to kind of have that perspective we'll jump right back into our interview with Wes after a short segment now it's time for a segment called data science best practices I'm here with Ben's cranker an independent data science consultant hi Ben hey Hugo it's great to be back so what are we discussing this week well do you need to explain or to predict what do you mean Bremen has this must-read paper called statistical modeling the two cultures he discusses that there are two modeling approaches the algorithmic modeling culture and the data modeling culture by algorithmic modeling he means that the machine learning approach which has largely been developed by computer scientists he contrasts this with the data modeling culture which views data as stochastic and worries about modeling the data generating process economists and statisticians tend to fall into the latter camp Bremen argues that machine learning is dominant in terms of performance accuracy and ease of use is that all there is to it if only life were that easy for a large class of problems he is right and traditional data modelers are rushing to adopt these methods but the algorithmic approach fails for another large class of problems those where you need to explain the problem ie to understand a causal connection Kelly Kelly has done some great research in this regard she points out that you need to choose your method and approach based on whether you want quote to explain or to predict I see so use machine learning to predict and statistics or econometrics to explain that is pretty much the case if your problems focus on prediction then m/l is the place to start and is incredibly powerful both shallow and deep learning models are producing incredible results for predictive and perceptual problems but often we need to understand the drivers that affect a business problem in this case we must run an experiment or perform causal regression analysis to eliminate bias and our estimated effect sizes so how do you know which to use an experiment is the gold standard because of the magic of random assignment but you may not be able to run an experiment it could be too expensive take too long or be dangerous such as something a human subjects committee would not allow if you can't get experimental data then you must use observational data and perform a causal regression analysis which captures the key features of the data to eliminate bias so tell me more about bias we really need to talk about in dodging and depths at some point in dodging Andy is all the different ways bad stuff can be hidden in your error term which biases results such as sample selection omitted variable bias simultaneity and measurement error for example if you want to reduce churn you need to build an explanatory model determine how different levers affect churn in this case the data is censored because some customers remain customers past the end of our study we never observe them churning consequently we need to model the censoring process machine learning will have a difficult time learning how to compensate for censoring a classical survival model has censoring baked into the hypothesis space will almost surely work better finally I should add that including endogenous features in a machine learning model can cause all kinds of problems if you treat machine learning algorithms as black boxes you may regret it if your model needs to be retrained regularly you may have this problem and other other benefits to the data modeling approach another huge benefit of the data modelling approach is that you can do inference either frequent disturb ation that means you can formally state hypotheses about which levers matter and then test how likely it is that the data supports your hypothesis thanks Ben for explaining why prediction isn't a silver bullet and why we also need models to explain our data after that interlude it's time to jump back into our chat with Wes McKinney so when living in the East Village supporting yourself to build out the package did you have any inkling that it would achieve the growth and wide-scale adoption that it has no not really I mean I I believed that I mean obviously I had the belief that Python ecosystem had a lot of potential and that projects like pandas were necessary to help the language and the community realize the potential like I think there was a lot of computational firepower in the numpy world and all the tooling scythe on and tools for interoperability with native code and so so I just wanted to help realize that potential but I didn't really have a sense of where go there were some other significant kind of confluence of things that happened particularly when you consider the development of stats models and scikit-learn which brought meaningful analytical functionality to Python like I think if Candace you know really the big thing that made pandas successful was the fact that it could read CSV files reliably and so it became like a first port of entry for for data into Python and for kind of data cleaning and data preparation and so if you wanted to do machine learning and scikit-learn or you wanted to use stats models for statistics and econometrics you needed to clean data first and so using pandas was the obvious choice for that but it yeah it wasn't it wasn't obvious and you know i cruded a couple of my former colleagues from a QR m klein and chong show to work with me on pandas and we explored starting a company around financial analytics in python powered by pandas but we were focused on building out pandas as an open source project kind of while we explored kind of that startup idea ultimately we didn't pursue that startup but we it was clear that by mid-2012 that we'd sort of crossed the critical horizon of people being interested in python as a language for data analysis and since then you've found certain institutions which have employed you in order to work on pandas right I wouldn't say that outside of might at a QR when I was building pandas kind of initially I've never been employed directly to work on pandas I started the company called datapad with Cheung sha and so it was a venture back company and we were building a visual analytics product that was powered by pandas and other Python the data pad was acquired by cloud era at the end of 2014 and so chun and i landed there to work on and my role at Cloudera was to look holistically at the big data world and figure out how to forge a better path for python and data science tools in general in the context of the big data world that's the Hadoop ecosystem and spark and kind of all the technology that was largely Java based which had been developed since you know 2006 or 2008 and so but I wasn't working on pandas in particular at that point and I sort of had taken stock of the structural and kind of infrastructural problems that pandas had and I gave a talk at the end of 2013 at PI data in New York on the title of the talk was practical medium data analytics and Python and the subtitle of the talk was 10 things I hate about pandas I remember so I add this kind of in the background this feeling that pandas was built on a fantastic platform for scientific computing and numerical computing so if you are doing particle physics or HPC work in a national lab with a supercomputer you know Python is really great and that's how the ecosystem developed in the late 90s early 2000s but for statistical computing and big data and analytics fact that like strings and categorical data wasn't a first-class citizen in that world made things a lot a lot harder missing data was not a first-class citizen and so there were a lot of problems that had accumulated and so at that point I started to look beyond pandas as it was implemented then into kind of how we could build technology to advance the whole ecosystem and beyond the Python world as well so I think a through-line in in this is really encapsulated by a statement you made earlier which is you want to build technologies and tools that a truly relevant for doing statistics or working with data and I know as a tool builder you're committed to developing human interfaces to data make individuals more productive and I think that actually provides a really nice segue into a lot of what you're thinking about now in particular the apache aero project so i'm wondering if you can tell me about apache aero and how you feel it can facilitate data science work yeah so I got involved in in what became the apache aero project you know as part of my work at Cloudera so one problem that had plagued me as a Python programmer was the fact that when you arrived at foreign data and foreign systems that you want to plug into whether those are other kinds of you know ways of storing data or accessing data or accessing computational systems that we were in a position of having to build custom data connectors for Python for pandas or kind of whatever Python library you're using and so I felt that we were losing a lot of energy to building custom connectors into all of these different things and this problem isn't unique to Python so if you look at all of the number of like different pairwise adapters that are available to convert between one data format and another or serialized data from one programming language to another programming language so sharing data was something that had caused me a lot of pain and also sharing code and algorithms was a big problem so the way that pandas isn't implemented internally it has its own custom way of representing data that's layered on top of numpy arrays but we had to essentially re-implement all of our own algorithms and data access layers from scratch you know we'd implemented our own CSV Reader our own interfaces to hdf5 files our own interfaces to json data we have pretty large libraries of code and pandas for doing in memory analytics aggregating arrays performing group by operations and if you look across other parts of the big data world you see the same kinds of things implemented in many different ways and many different programming languages in our you have the same thing many of the same things implemented in our so I was kind of trying to make sense of all of that energy lost to sharing data and sharing code and thinking about how I could help enable the data world to become a lot less fragmented and people building systems people like me who build tools for people how to make people like me we're building tools a lot more productive and able to build that and more efficient data processing tools in the future and so this was just kind of feelings that I had and so I started to poke around cloudera and see if other people felt the same way and so I was working with folks on the Impala team people like Marcel corn Acker who started the Impala project todd lipkin who started the Apache kudu project it's now Apache and Paula joined the Apache foundation so you know there were a lot of people at Cloudera that essentially agreed with me and we sort of thought about like what kind of technology we could build but help improve interoperability and we sort of centered on the problem of representing data frames and tabular data and as we kind of looked outside of cloud era we saw that there were other groups of developers who concurrently were thinking about the exact same problem so we bumped into folks from the Apache drill project which is a sequel on Hadoop system and they were also thinking about the tabular data interoperability problem like how can we move around tabular data sets and reuse algorithms and code and data without so much conversion and in the energy loss and so very quickly you know we got 20-25 people in the room representing 12 or 13 open-source projects with a general consensus that we should build some technology to proverbially you know tie the room together that became Apache arrow but it took all of 2015 to put the project together now how is all this relevant to data science well what the arrow project provides is a way of representing data and memory that is language agnostic and standardized and portable so you can think of it as being like a language independent data frame so if you create arrow based data frames in Python you can share them with any system whether that's written in C or C++ or JavaScript or Java or rust or go as long as they implement the arrow columnar format they can interact with that data without having to convert it or serialize to some kind of intermediate representation like you usually have so the goal of the project in addition to providing high-quality libraries for building data science tools and building databases is also to improve the portability of code and data between languages outside of kind of the interoperability side of the project there's also the goal within the walls of a particular data processing system to provide a platform of algorithms and tools for memory management and data access that can accelerate large-scale data processing so we wanted the arrow columnar format to support working with much larger quantities of data the single node scale data that is particularly data that does not fit into memory I love this idea of you know tying to room together as you put it cuz essentially it speaks the idea of kind of breaking down the walls between all these silos that exist as well right yeah yeah I know I mean I think if you look across and just within the data science world I mean even though functionally we're solving many of the same problems like there's very little collaboration that happens between the communities whether collaborating at the software design level or at the code level and as a result people point fingers and accuse people of reinventing wheels or like not wanting to collaborate but you know really it's if your data is different in memory there's just no basis for co-chairing in most cases and so the desire to create an open standard for data frames is just if you want to share code it's essential you have to standardize the representation in RAM or on the GPU or essentially at the byte or the bit level agreeing on how what the data looks like once you load it off disk or at once you parse it out of a CSV file is the basis of collaboration amongst multiple programming languages or amongst different data science languages that are ultimately based in C or C++ yeah I remember actually Fernando Perez spoke to this as well in his keynote where you also keynote at the inaugural Jupiter con saying we welcome so many contributions but we need to agree on some things right and these are certain things that we've all agreed upon so if you're going to contribute let's build on these particular things right right yeah I know I think the Jupiter project certainly socialized this idea of open standards by developing the kernel protocol providing a way it's like you know here's the abstract notion of like a computation notebook and here's how if you want to build a kernel add a new language to the Jupiter ecosystem you know here's how you do it and you know that certainly has played out you know beautifully with you know I think it's like over 40 languages have kernel implementation for Jupiter but you know I think in general I think people are appreciating more the value of having open standards where that our community developed and that are developed on the basis of consensus and where there's just like kind of broad buy-in it's not it's like what one developer or one sort of isolated group of people building some technology and then trying to get people to adopt it so I think Jupiter is someone is unique in the sense that it started out in the Python world but I think it's there you know they set out with the goal of embracing a much broader community of users and developers and that's played out in really exciting ways I really like the descriptions you gave and kind of the inspiration behind the arrow project in particular you know the need for interoperability the importance of these portable data frames I don't wanna go too far down the rabbit hole I can't really help myself though I'd like you to speak just a bit more to kind of your thoughts behind the challenge of working in the Big Data limit I mean for example that we have computers and hard drives that can store a lot of stuff but we don't actually have languages that can interact with unless we parallelize it right great right so a common thing that I've heard over the years from people will say Wes like I just want to write pandas code but I want it to work with big data so it's a complicated thing because the way that a lot of these libraries are designed the way the pandas is designed and a lot of libraries that are similar to pandas it's the implementation and kind of the evaluate like the computational model like when computation happens like what are the semantics of the code that you're writing there's a lot of built-in assumptions around like the idea that data fits in memory and that you know when you write a plus B that like a plus B is evaluated immediately and materialized in memory and so if you want to scale out kind of scale up computing to data frame libraries you essentially have to re architect around the idea of deferred evaluation and essentially defining kind of a rich enough algebra or kind of intermediate representation of analytical computation where you can actually use a proper query engine or a query planner to execute operations and so really what is needed is to make libraries like pandas internally more like analytic databases and if you look at all the innovation that has happened in the analytic database world over the last 20 years we call them their databases and things that have happened in the Big Data world you know very little of that of that innovation in scalable data processing has made its way into the hands of data scientists so really you know one of my major goals with working my involvement in the Aero project is to provide the basis for collaboration between the database and analytic database world in the data science world which is just not something that's happened before ultimately the goal is to create like an embedded analytic database that is language independent and can be used in Python can be used an R that can work with much larger quantities of data but it's going to take like a different approach in terms of the user API because I think that this idea of like magically retrofitting pandas or essentially retrofitting pandas with the ability to work with hundreds of gigabytes of data or terabytes of data yeah I hate to say it's a little bit of a pipe dream I think it's going to require some breaking changes and some kind of some different approaches to the problem that's not to say that pandas is going away I mean pandas is not going anywhere and I think is certainly is occupying the the sweet spot of being like the ultimate Swiss Army knife for data sets under few gigabytes so does this conversation relate to the murmurings we've heard of potential pandas to in the pipeline yeah so we at the end of 2015 I started a discussion in the pandas community and so just FYI I think people you know are often thinking why ICC people out in the community of the like west you know thanks so much for pandas I have to remind them like to go out of your way and thank Geoffrey back and yours vandenBosch and phil cloud and tom alex burger and the you know the other pandas core developers that have really been driving the project forward over the last five years I haven't been very involved in the day-to-day developments and sometime in 2013 but at the end of 2015 I started spending some more time with the pandas developers said been building this project for its 7 year you know it's a little over 7 years old the code base are there things that we would like to fix like what are we gonna do about the performance and memory use and scalability issues I can't remember I don't think at that point I don't know that desk data frame existed and so desk has provided kind of an alternative route to scaling pandas by using pandas kind of as is but essentially re-implementing pandas operations using a desk computation graph but looking at the kind of single node scale kind of the in-memory side of pandas we sort of looked at you know what we'd like to fix about the pandas internals and that was what we know we described as the kind of pandas to initiative and around that time we were just getting ready to kick off the Apache arrow project and so I wouldn't say that we you know we reached kind of like a fully baked you know game plan in terms of how to create a quote-unquote pandas too but I think we reached some consensus that we would like to build a evolve data frame library that is a lot simpler in its functionality so shedding some of the baggage of multi indexes and some of the things in pandas that are can be a bit complex and also don't lend themselves very well to out of core you know on like very large not don't fit into memory datasets but something that's focused on dealing with the very large data sets at a single node scale so large out of core just big data sets on a laptop so we are I mean we are working on that and you know I think the project itself is not going to be called handus - just to kind of not confuse people and the pandas project is we all got together the pandas team we all got together in Austin over the summer and this is one of the topics that you know we're gonna continue to grow and kind of innovate and evolve the current pandas project kind of not as it is right now but my goal is to grow a parallel kind of companion project which is powered by the Apache aero ecosystem and provides the pandas light user experience in terms of usability and functionality but is really focused on kind of powering through very large on-disc datasets we'll jump right back into our interview with where this is after a short segment let's now dive into a segment called studies in interpretability I'm here with Patrick oil machine learning engineer and one of the core developers of the open source statistical modeling platform pie mc3 great to have you on the show patter thanks for having me here so we're here to talk about interpretability in building machine learning models and in data science more generally interpretability is telling you why our model makes certain decisions and this is important but it's more important in some areas than others right I mean it'll be more important in insurance and health care for example than in ad tech a space that you've worked in yes he go it's fair to say that interpretive ility matter is less in our tech the cost of showing a wrong ID is very different to say the cost of mispricing insurance policy can you speak to this a bit more from your perspective yeah so an odd tech the bottles I worked on largely involved lots of clever feature engineering and the deployed models were really logistic regression due to the fact that they are easy to paralyze so in our tech we care more about things like predictive accuracy because that's tied directly to the economic impact we don't care as much about explaining the model to internal customers or regulators I mostly agree however you could imagine an algorithm that shows wealthy teenagers ads for colleges but shows minorities ads for bail bondsmen having said that in finance and insurance being able to explain models matters a lot right right the course of a mistake in credit risk models is very high you've loaned for example to a customer or client who defaults I think as we see more applications of AI or ml and surance healthcare and other regulated industries we need to be more mindful of that so can you comment on some work you've seen in those industries well sadly some of the best work I've seen has been under a nondisclosure agreement or NDA one example I saw a model for protecting credit risk for loans the model itself was a random force would lie him on top of it for those of you who don't know lime sounds for locally interpret will model agnostic explanations and lion is basically a toolbox that allows you to get explainable outputs from your black box models in the credit risk model case it was easy to build a framework for handling customer requests in regards why they were flagged up as at risk of default and I was able to convert that information into actionable information such as pay off your credit card debt or pay off your student loans and how about in insurance well basically insurance companies have to allocate reserve capital to compensate for future losses there's a lot of historical work in actuarial community was quite mathematically basic mikuni leveraged newer techniques and using programming languages like orrin stan to compete a loss ratio that is a total mind that will be lost by the insurance company the future claims this is a great example of where a better modeling approach can help you better understand your risk a more complicated model is worth in this case since the use case involves so much risk and so much capital this model was interpreted for example one could see the naturally incorporated uncertainty in the posterior distribution business knowledge was also incorporated furthermore there was increased confidence in the model and the explicit statement of assumptions improved interpretability therefore is clear that this modeling approach can be a useful addition to your toolbox and also can provide insights the traditional machine learning methods can't provide for users who'd love to learn more with these Bayesian techniques then I recommend mix resources search for loss curves case study understand case studies website or my course on probabilistic programming is also excellent that's called probabilistic programming primer thanks for speaking today you patter anything I can do to help the listeners time to get straight back into our chat with Wes McKinney I'd like to step back a bit and think about open-source software development in general I suppose spoiler alert where I want this to go is to talk about your one of your latest ventures or labs but I'm wondering in your mind what the biggest challenges for open-source software development are at this point in time well we can have a whole podcast just about this topic and of course it depends on the stage of a project and all of these the way that I frame the problem when I talk to people is that I think open-source projects face you know funding and sustainability problems of different kinds depending on the stage of the project so I think in the early stages of projects when you're building something new or you're essentially solving a known problem in a different way it can be hard to get support from other developers or financial support to sponsor individuals to work on the project because it's hard to build consensus around something new and there might be like even competing approaches to the same problem and so we're talking about the kind of funding that can support full-time software developers you know it can be a lot of money and so committing a lot of money to support a risky venture into kind of building a new open-source project which may or may not come become successful can be a tough pill to swallow for potential financial backer later on as projects become you know wider adopted they start becoming particularly projects that are so they're foundational and you can call them like I think the popular term is like open source infrastructure there was a report so Nadia Iqbal wrote the report called roads and bridges about kind of open source infrastructure with the Ford Foundation and sort of is about this idea of like thinking about open source software is like a public good roads and bridges and like public infrastructure that everyone uses and you know with public infrastructure it's great because it's supported by tax tumblers but we don't exactly have a open source tax I'm you know I could get behind one but you know we don't have that kind of same kind of mentality around funding critical open-source infrastructure and I think that as projects become really successful and they become something that people can't live without they end up facing the classic tragedy of the Commons problem where people feel like well you know they derive some they derive a lot of value from the project but because everyone used this is a project they don't want to foot the bill of supporting and maintaining the software project so whether you're on the early side of a project or the late you know in the early stage or a late stage I think there's different kinds of funding and sustainability challenges and in all cases I think open-source developers and particularly as projects become more successful you end up quite over burdened and you know burnout risk and I know I've I've experienced a burnout many times and many other open-source developers have have experienced periods of significant burnout so what can listeners who are you know working or aspiring data scientists or data analysts in organizations or c-level people within organizations do for the open sort what would you like to see them do more for the open source well I think I think users and other folks can help with so as people like me I guess I've recently been kind of you know working on putting myself in a situation where I am able to raise money and put to work money that is donated for direct open-source development and so and I think so the best way a lot of people can help is by selling the idea of supporting and either through development work or through direct funding supporting the open source projects that you rely on so I think a lot of companies and a lot of developers are our passive participants in open source projects and so finding a way to contribute whether it's through money or time it is difficult because many open source projects particularly ones that are systems related to infrastructure they don't necessarily lend themselves to casual quote unquote casual contributions so if it's your 5% project or your 20% project it can be hard as an individual to make a meaningful contribution to a project which may have a steep learning curve or just require a lot of intense focus and so I think for a lot of organizations the best way to help projects can be to to donate money directly so I think something this provides a nice segue into your work at Mercer labs I'd love you to just give us a rundown of ursa labs in particular how it frames you know the challenges of open source software development yeah so server so labs is an organization I partnered with Hadley Wycombe from the art community and in our studio to found ursa labs earlier earlier this year the the kind of resin d'etre of versa labs was to to build shared infrastructure for data science in particular building out the aero ecosystem as a the apache aero ecosystem as it relates to to data science and making sure that we have high quality consistent support for all of that new technology in the python and our world and and beyond and improving interoperability for data scientists that use all those programming languages but the particularly gist achill details of versa labs is that we wanted to be able to effectively put together an industry consortium type model where we can raise money from corporations and use that money to hire full-time open-source developers so at the moment you know sorsa labs is being being supported by our studio I to Sigma where I used to work right up until the founding of ursa labs and it's now being funded by Nvidia the makers of graphics cards and so we're you know kind of act working actively on bringing in more you know sponsors to build a larger team of developers and I think it's really confronting that challenge that I think for an engineer at a company as a part time contributor to an open-source project may not be as effective or nearly as effective as a full-time developer and so I want to make sure I'm able to build an organization that is full of outstanding engineers who are working full-time on open source software and making sure that we were able to do that in a scalable and sustainable way and is kind of organized for the benefit of the open source data science world so anyway and I having been through the consulting path and the startup path and working four single companies I think a consortium type model where it's being funded by multiple organizations and where we we're not building a product of some kind it's kind of a new model for doing open source development but one that I'm excited to pursue and see things go yeah I think it's really exciting as well because it does approach a lot of the different challenges one in particular it's a trochee it's a common problem right of developers being employed by organizations and being given a certain amount of time to work on open source software development but that time being eaten away because of different incentives within organization essentially yeah I mean it you know I think there have been ton of contributions to pandas into Apache aero from developers that work at corporations and those contributions mean a lot so definitely still looking for companies to collaborate on the roadmap and to work together to build kind of new computational infrastructure for data science you know I think it's tough when you know the developer might show up and be spending a lot of time for a month or two and then based on their priorities within where the company where they work they might disappear for six months and that's just the nature of things you know I think the kinds of developers that make big contributions to open-source can often be more senior or like gonna be very important developers and their respective organizations and so frequently gets kind of called in to kind of prioritize close source or internal projects that's just kind of you know the ebb and flow of corporate environment so I've got a relatively general question for you what does the future of data science tooling look like to you well speculative of course but you know I think by spending my time on arrow project you know my objective and what I would like to see happen in data science tooling is a defragmenting of of data and code so to have increased standardization and an adoption of open standards like the arrow columnar format storage formats like park' and or protocols for messaging like G RPC so I think that in the future I believe that things will be a lot more standardized and a lot less fragmented kind of a slightly crazy idea I don't know how crazy it is but I think also in the future that programming languages are going to diminish and importance relative to data itself and common computational libraries this is kind of a self-serving opinion but I do think that if to be able to leave data in place and to be able to choose the user interface namely the programming language is the programming language that best suits your needs in terms of interactivity or software development or so forth that you know you can use multiple programming languages to build an application or pick the programming language that's you know that you prefer while utilizing common libraries of algorithms common query engines for processing that data and so I think we're beginning to see kind of murmurings of this defragmentation happening and I think the arrow project is kind of kick along this process and socialize the idea of what a more defragmented and more consistent user experience for a data scientist what that might look like that's a very exciting future so my last question for you is is do you have a final call to action for ant for our listeners out there yeah I would say my call to action would be to find some you know meaningful way to you know contribute to the open source world whether it's sharing your ideas or sharing your use cases about what parts of you know the open source stack are working well for you or what parts you think could serve you better if you are able to contribute to projects whether through discussions on mailing lists or github or commenting on the roadmap or so forth you know that's all very valuable I think a lot of people think the code is the only real way to contribute to open source projects but actually you know I spend a lot of my time it's not writing code it's reviewing code and kind of steering discussions about design and roadmap and future scope and I think the more voices and the more people involved to help build consensus and kind of help prioritize the work that's happening in open source projects helps you know make healthier and more productive communities and if you do work in an organization that has the ability to donate money to open-source projects you know I would love to see worldwide corporations effectively tithing a portion of profits to fund open source infrastructure I think if corporations gave you know a fraction of one percent of their profits to open source projects the funding and sustainability crisis that we have now would essentially go away and obviously I guess that's might be a lot to ask but I can always hope so any corporations can lead by example certainly if you do donate money to open source projects you should you know make a show of that and make sure that other corporations know that you're a good citizen and you're helping support the work of open source developers I couldn't agree more where's it's been an absolute pleasure having you on the show thanks you go fun thanks for joining our conversation with Wes about pandas data analysis tooling in general the future of data science and the challenges of open source software development where's stated that he thinks in the future that programming languages are going to diminish in importance relative to data itself and common computational libraries and his work on Apache Aero is central to this vision the concept of portable data structures that are accessible from a variety of programming languages and that can leverage the vast computational power we now have to work in the limit of at least hundreds of gigabytes many popular data science tools such as pandas in general do not effectively leverage modern hardware one of ursa Labs goals is to empower and accelerate the work of data scientists through more efficient and scalable in-memory computing we also discussed Ursa Labs which I am so excited about and how the model of open-source software development here is to effectively put together an industry consortium type model where they can raise money from corporations and use that money to hire full-time open source developers also get ready for next week's episode our 2018 season one finale a conversation with kathy o'neill data scientist investigative journalist consultant algorithmic auditor and the author of the critically acclaimed book weapons of mass destruction Cathy and I will discuss the ingredients that make up weapons of mass destruction which are algorithms and models that are important in society secret and harmful from models that decide whether you keep your job the credit card or insurance or algorithms that decide how we're police sentenced to prison or given parole Cathy and I will be discussing the current lack of fairness in artificial intelligence how societal biases are perpetuated by algorithms and how both transparency and auditability of algorithms will be necessary for a fairer future what does this mean in practice join us next week to find out as Kathy says fairness is a statistical concept it's a notion that we need to understand at an aggregate level and moreover data science doesn't just predict the future it causes the future you'd best tune in for this our final episode of season one of data framed I didn't intend for that to sound threatening but it came across so I'm your host Hugo Bound Anderson you can follow me on Twitter at Hugo Bound and data camp at data camp you can find all our episodes and show notes sedate account.com slash community slash podcast youWow in this episode of data framed at data count podcast i'm speaking with wes mckinney creator of the pandas project for data analysis tools in python and author of python for data analysis among many other things Wes and I will talk about data science tool building what it took to get pandas off the ground and how he approaches building human interfaces to data to make individuals more productive on top of this we'll talk about the future of data science tooling including the Apache ro project and how it can facilitate this future the importance of data frames that are portable between programming languages and building tools that facilitate data analysis work in the Big Data limit pandas initially arose from wares noticing that people were nowhere near as productive as they could be due to a lack of tooling and the projects he's working on today which we'll discuss arise from the same place and present a bold vision for the future Wes also makes clear that pandas is not a one human show and when people thank him for his work he reminds them to thank Jeffrey back Joris vandenBosch a-- philip cloud and calm alex Berger along with the other pandas core developers that have really been driving the project forward over the last five years I for one want to thank all of them finally and not to give too much away we'll also discuss the challenges of open source software development and how wears is approaching funding and resourcing OSS with his most recent venture versa labs find out all this and more including how much of pandas was developed in a small East Village apartment that West may or may not have cohabitated with mice well I can't give too much away this is just the opening monologue I'm Hugo Baron Anderson a data scientist the data camp and this is data frame welcome to data frame a weekly data count podcast exploring what data science looks like on the ground for working data scientists and what problems are consulting I'm your host Hugo Bound Anderson you can follow me on Twitter that's you go down and data camp at data camp you can find all our episodes and show notes at data camp comm slash community slash podcast listeners as always check out the show notes for more material on the conversation today I've also included a survey in the show notes and a link to a forum where you can make suggestions for future episodes I'd really appreciate it if you take the survey so I can make sure that we're producing episodes that you want to hear now back to our regularly scheduled programming in Python hi there Wiz and welcome to data framed thank you thanks for having me great pleasure to have you on the show and I'm really excited to have you here today to talk about open source and software development to talk about your work at Ursa labs Apache arrow a bunch of other things to do with tool building but first I'd like to find out a bit about you perhaps you could open by telling me what you're known for in the data community sure yeah I mean I'm best known for for being the the original author of the Python panda's project which I understand that a lot of people use so I started building that as a closed source library a little over 10 years ago and I've been working on a lot of different open source projects for for the Python data science world world and beyond I also wrote a book called Python for data analysis which is now in its second edition and I think that's become a pretty pretty ubiquitous resource for people that are getting into the field of data science and are wanting to learn how to use pandas and you know get their feet wet with working with data and congrats on the second edition that was in the past year or so that was released right yeah I was yeah the end of just about a year ago the end of 2017 and how did you get into data science tool building originally because I'm aware that your background wasn't in in CS per se right I was a mathematician so I studied at a pure math at MIT I did a little bit of a little bit of computer science I had had some exposure to the world of machine learning in that I was aware that it existed but you know MIT didn't have a statistics to program so data analysis and statistics wasn't very familiar to me when I was entering the working world and I got a job at an investment manager called AQR Capital Management which is based in Greenwich Connecticut and there were a number of MIT grads that had gone to work there and some of them were math majors and they kind of sold me on the idea of getting experience with applied math and then maybe I would go back to grad school later on and I found that in my job there that rather than doing very much applied math that I was really doing a lot of data munging so I was writing sequel I was using Excel and really I just found that um I wasn't as productive and efficient working with the data as I felt like I should have been and part of it was like well okay I'm just just starting out my career I'm 22 years old you know what do I know but I looked around and even even at people who were a lot more senior to me and a lot more experienced and it seemed like they weren't very productive either and they were spending a lot of time you know obviously their skill with Excel and cell shortcuts and so forth keyboard shortcuts was a lot better than mine but still it seemed like there was just something missing to kind of working with data and I started to learn I started to learn are at the end of 2007 beginning of 2008 and there were at that point in time you know the our ecosystem was a lot less mature and it felt like you know an interesting language for you know valuable language for doing statistics and data analysis but we also needed to kind of build software and so I learned a little bit of Python and thought like wow this is a really easy to use programming language I had done some Java programming and thought that I just wasn't very good at Java and so I thought man I'm just not cut out for building software but I decided to have a tinker with building some data manipulation tools in Python that was March April 2008 and you know just went down the the rabbit hole from there and what once I had made myself more productive working with data I started evangelizing the as I was building to my colleagues and you know I kept pulling on one thread and ended up becoming more of a software engineer in uh finance or you know a finance or math person yeah there are a lot of interesting touch points there for example your background in pure math and that you're in Connecticut I actually I was working in pure math and ended up doing applied maths in a biology lab in New Haven Connecticut not not in Greenwich but at that point I actually started dealing with data a lot as well and that's when I started getting into data science also it's also interesting that pandas when you first developed it was close source but before we get there you've spoken a bit too you know why why you chose Python could you explain a bit more about what was attractive about Python then because of course one of the a lot of the attractive things for researchers and data scientists now about python is the data science staff you know pandas scikit-learn an umpire or all of these things so what-what made you really like it back in the day yeah I mean at that at that point of time you know 2007-2008 in terms of doing statistical computing Python was not you know let's let's think of it as a promising world that has not yet been terraformed so I think that there were kind of the nuts and bolts of a really interesting environment you know I learned about the ipython project and said you know okay here's a here's a really nice interactive shell where you can plot things and you know it has tab completion and you know really basic interactive affordances that really help out a lot you had the nuts and bolts of the of doing all the analytical computing that you need to do for data manipulation numpy had it's a 1.0 release I think in 2006 and it had become a mature project and the you know scientific Python world was defragmenting itself after the number a numeric rift which had persisted for several years and you know Travis Travis olifant had sort of worked to bring those communities together but really I think what attracted me to the language was the accessibility and the fact that it was really very suited for interactive and exploratory computing that you could you didn't have to set up an elaborate development environment you know an IDE to be able to get up and running doing some really basic things and so having had experience with Java I think one of the things that put about Java was the elaborateness of the environment that you need to really be productive like you really need to set up an ID and if there's all this tooling that you need to do whereas with Python you could do you know some pretty complex things with a few lines of code in a text file and then you just run the script so that kind of like interactive scripting feel of doing exploratory computing was really compelling to me at the time but obviously it was Python was missing a lot of tools and so it was it was a bit daunting to start the process of building some of those tools from scratch yeah and you mentioned I Python numpy and Travis and I suppose you know this is the time where John Hunter was working a lot on on matplotlib and working with Fernando to incorporate it with with ipython there was a lot of close collaboration I suppose this speaks to the idea of community as well and did you find the scientific Python community something that would that was also attractive yeah well I you know I didn't have much interaction with the community until much later I think the first person there's two people that I met from the Python community who are like my first point of contact with that world so so one person is Erik Jones who is a founder of n thought which is the original like Python scientific computing company based in Austin Texas and I also run the sci-fi conference yes they run Syfy and and so n thought was doing a lot of I was doing a lot of consulting work in New York City with financial firms that were getting big into Python during that era like training and kind of custom development and I got in touch with Erik sometime during 2009 and and sort of gave him like kind of the very first external demo of pandas and this was right around the time that we were getting ready to publish the pandas bits on pipe I and so forth kind of the first open-source version of the project and then the the second person I met was John Hunter and himself from from matplotlib I met him in Chicago in January 2010 you know at that point I was looking around for like how to engage with Python world having you know just open sourced pandas and because John was working he worked for trade link and up until his death in 2012 he was a quant there having been a neuroscientist and and kind of had the building matplotlib for many years he he kind of took me under his wing as kind of I was his he was my mentor for you know for a couple of years and kind of helped me enter and get involved in the community and so I definitely feel that the eye fountain found it a very warm and kind of inviting community very collaborative and collegial and I think I was attracted to that you know that feeling you know it didn't seem like a lot of people competing with each other it was really just a lot of pragmatic software developers looking to kind of build tools that were useful and to help each other help each other succeed yeah and you actually still get the sense that when you go to site buy-in in Austin Texas every every July or every second year you you get still get a strong sense of community and people just loving building the tools together yeah yeah totally totally I mean obviously the community has grown much bigger and I think the the ratio of project developers people working on the open-source projects to the users that ratio has certainly changed a lot and that there are a lot more users now and there are developers you know I think the very first sci-fi conference was probably the majority of people there were people who are the developers of open-source projects but you know still I think it's a great community and I think that's that's helping kind of continue to bring people into the ecosystem yeah and actually I had Bryan Granger on the podcast recently and we discussed those you know several people discussing at the moment that we're now kind of entering a phase transition from having individual users spread across all sand spread across the globe of a lot of open source packages to actually having large-scale institutional adoption right yeah yeah for sure and I'm wondering in terms of pandas starting off as a project I'm under the impression it was started as a tool to be used in finance is that the case yeah I mean it was focused so if you look you can go back and download pandas zero point one which was published up IPI in December 2009 and see what was in the library and compared with now the functionality was a lot more geared towards time series data and the kinds of problems that we were dealing with back at AQR I wouldn't say that it necessarily is finance specific is very general data manipulation it was a pretty small project back then but it was just about dealing with tabular data dealing with messy data data munging Mis kind of data alignment essentially kind of all those like kind of really basic wrangling and data integration problems it wasn't really until 2011-2012 that the project got built like I built the project out and created a more comprehensive set relational algebra facilities like it didn't have complete joins like all the different kinds of basic joins until 2011 so its features that was certainly skewed by the use cases that we had in front of us back in AQR and how did you get the project off the ground I know that's a relatively ill-formed question but just in terms of hours and people and resources well you know you smelled metal or you kind of you forage to weapons you have to get the the crucible really really hot so we open sourced the project at the end of 2009 and I think we had deliberated kind of the whether or not to open source at all for about six months or so and ultimately powers that be decided that we would open source pandas and sort of see what would happen I gave my very first talk about pandas you can still find online at PyCon 2010 and in Atlanta and it was about using the subject of the talk was about using Python and quantitative finance but the the project didn't really go anywhere after that so it was on hosted on Google code this was you know could have existed but it was kind of a ruby thing at that time and I left a QR to go back to grad school I went to Duke to start a PhD in statistics statistical science it's called there and I continued to do a little bit of contract work developing pandas for a QR and somewhere I don't I think the catalyst for me was in early 2011 I started to get contacted by more companies that were exploring using Python for data analysis use cases and they had seen my talk at PyCon and were interested in getting my perspective on statistical computing and I just had this feeling that the ecosystem was facing a sort of existential crisis about whether or not it was going to become truly relevant for doing statistics it was clear to me that pandas was promising but really had not reached a level of functional completeness or usefulness to be the foundation of a statistical computing ecosystem in Python and so I guess I felt that feeling so strongly that I you know I sort of had like an epiphany where it wasn't quite like you know shouting Eureka and jumping out of the bathtub but I emailed my advisor and said hey I would like to take take a year off from my PhD and go explore this Python programming stuff and we'll see how it goes then I had some money saved from my first job and I I moved back to New York into a tiny apartment in the East Village which had mice and stuff really not the best place I've ever lived but I you know essentially was like I'm just gonna work full-time on pandas for a while and build it out and see see what happens and I think that's when as soon as started kind of socializing the functionality of pandas and filling in future gaps you know implementing joins and fixing some of the internal issues of course I created other internal problems but there were some there were definitely some design problems in the early versions of pandas that got fixed in the summer of 2011 but as soon as pandas could read CSV files pretty reliably and could do joins and a lot of the basic stuff that you need to be productive working with multiple data sets I think that's when it started to catch people's eye toward the end of 2011 and starts to take off the ground so around the same time I pitched the idea of a data analysis book in Python to O'Reilly and they agreed to do a book which looked thinking back on it was a bit risky because you know who knows what would have become of pandas was not at all a prompt you know not obviously going to be successful back in 2011 so they decided to take a bet and so much so that you know I asked them later why they didn't put a panda on the cover but they said well we're saving the Panda for something really big and so it wasn't even clear then that Python and pandas and everything was going to be a popular thing so it's important to kind of have that perspective we'll jump right back into our interview with Wes after a short segment now it's time for a segment called data science best practices I'm here with Ben's cranker an independent data science consultant hi Ben hey Hugo it's great to be back so what are we discussing this week well do you need to explain or to predict what do you mean Bremen has this must-read paper called statistical modeling the two cultures he discusses that there are two modeling approaches the algorithmic modeling culture and the data modeling culture by algorithmic modeling he means that the machine learning approach which has largely been developed by computer scientists he contrasts this with the data modeling culture which views data as stochastic and worries about modeling the data generating process economists and statisticians tend to fall into the latter camp Bremen argues that machine learning is dominant in terms of performance accuracy and ease of use is that all there is to it if only life were that easy for a large class of problems he is right and traditional data modelers are rushing to adopt these methods but the algorithmic approach fails for another large class of problems those where you need to explain the problem ie to understand a causal connection Kelly Kelly has done some great research in this regard she points out that you need to choose your method and approach based on whether you want quote to explain or to predict I see so use machine learning to predict and statistics or econometrics to explain that is pretty much the case if your problems focus on prediction then m/l is the place to start and is incredibly powerful both shallow and deep learning models are producing incredible results for predictive and perceptual problems but often we need to understand the drivers that affect a business problem in this case we must run an experiment or perform causal regression analysis to eliminate bias and our estimated effect sizes so how do you know which to use an experiment is the gold standard because of the magic of random assignment but you may not be able to run an experiment it could be too expensive take too long or be dangerous such as something a human subjects committee would not allow if you can't get experimental data then you must use observational data and perform a causal regression analysis which captures the key features of the data to eliminate bias so tell me more about bias we really need to talk about in dodging and depths at some point in dodging Andy is all the different ways bad stuff can be hidden in your error term which biases results such as sample selection omitted variable bias simultaneity and measurement error for example if you want to reduce churn you need to build an explanatory model determine how different levers affect churn in this case the data is censored because some customers remain customers past the end of our study we never observe them churning consequently we need to model the censoring process machine learning will have a difficult time learning how to compensate for censoring a classical survival model has censoring baked into the hypothesis space will almost surely work better finally I should add that including endogenous features in a machine learning model can cause all kinds of problems if you treat machine learning algorithms as black boxes you may regret it if your model needs to be retrained regularly you may have this problem and other other benefits to the data modeling approach another huge benefit of the data modelling approach is that you can do inference either frequent disturb ation that means you can formally state hypotheses about which levers matter and then test how likely it is that the data supports your hypothesis thanks Ben for explaining why prediction isn't a silver bullet and why we also need models to explain our data after that interlude it's time to jump back into our chat with Wes McKinney so when living in the East Village supporting yourself to build out the package did you have any inkling that it would achieve the growth and wide-scale adoption that it has no not really I mean I I believed that I mean obviously I had the belief that Python ecosystem had a lot of potential and that projects like pandas were necessary to help the language and the community realize the potential like I think there was a lot of computational firepower in the numpy world and all the tooling scythe on and tools for interoperability with native code and so so I just wanted to help realize that potential but I didn't really have a sense of where go there were some other significant kind of confluence of things that happened particularly when you consider the development of stats models and scikit-learn which brought meaningful analytical functionality to Python like I think if Candace you know really the big thing that made pandas successful was the fact that it could read CSV files reliably and so it became like a first port of entry for for data into Python and for kind of data cleaning and data preparation and so if you wanted to do machine learning and scikit-learn or you wanted to use stats models for statistics and econometrics you needed to clean data first and so using pandas was the obvious choice for that but it yeah it wasn't it wasn't obvious and you know i cruded a couple of my former colleagues from a QR m klein and chong show to work with me on pandas and we explored starting a company around financial analytics in python powered by pandas but we were focused on building out pandas as an open source project kind of while we explored kind of that startup idea ultimately we didn't pursue that startup but we it was clear that by mid-2012 that we'd sort of crossed the critical horizon of people being interested in python as a language for data analysis and since then you've found certain institutions which have employed you in order to work on pandas right I wouldn't say that outside of might at a QR when I was building pandas kind of initially I've never been employed directly to work on pandas I started the company called datapad with Cheung sha and so it was a venture back company and we were building a visual analytics product that was powered by pandas and other Python the data pad was acquired by cloud era at the end of 2014 and so chun and i landed there to work on and my role at Cloudera was to look holistically at the big data world and figure out how to forge a better path for python and data science tools in general in the context of the big data world that's the Hadoop ecosystem and spark and kind of all the technology that was largely Java based which had been developed since you know 2006 or 2008 and so but I wasn't working on pandas in particular at that point and I sort of had taken stock of the structural and kind of infrastructural problems that pandas had and I gave a talk at the end of 2013 at PI data in New York on the title of the talk was practical medium data analytics and Python and the subtitle of the talk was 10 things I hate about pandas I remember so I add this kind of in the background this feeling that pandas was built on a fantastic platform for scientific computing and numerical computing so if you are doing particle physics or HPC work in a national lab with a supercomputer you know Python is really great and that's how the ecosystem developed in the late 90s early 2000s but for statistical computing and big data and analytics fact that like strings and categorical data wasn't a first-class citizen in that world made things a lot a lot harder missing data was not a first-class citizen and so there were a lot of problems that had accumulated and so at that point I started to look beyond pandas as it was implemented then into kind of how we could build technology to advance the whole ecosystem and beyond the Python world as well so I think a through-line in in this is really encapsulated by a statement you made earlier which is you want to build technologies and tools that a truly relevant for doing statistics or working with data and I know as a tool builder you're committed to developing human interfaces to data make individuals more productive and I think that actually provides a really nice segue into a lot of what you're thinking about now in particular the apache aero project so i'm wondering if you can tell me about apache aero and how you feel it can facilitate data science work yeah so I got involved in in what became the apache aero project you know as part of my work at Cloudera so one problem that had plagued me as a Python programmer was the fact that when you arrived at foreign data and foreign systems that you want to plug into whether those are other kinds of you know ways of storing data or accessing data or accessing computational systems that we were in a position of having to build custom data connectors for Python for pandas or kind of whatever Python library you're using and so I felt that we were losing a lot of energy to building custom connectors into all of these different things and this problem isn't unique to Python so if you look at all of the number of like different pairwise adapters that are available to convert between one data format and another or serialized data from one programming language to another programming language so sharing data was something that had caused me a lot of pain and also sharing code and algorithms was a big problem so the way that pandas isn't implemented internally it has its own custom way of representing data that's layered on top of numpy arrays but we had to essentially re-implement all of our own algorithms and data access layers from scratch you know we'd implemented our own CSV Reader our own interfaces to hdf5 files our own interfaces to json data we have pretty large libraries of code and pandas for doing in memory analytics aggregating arrays performing group by operations and if you look across other parts of the big data world you see the same kinds of things implemented in many different ways and many different programming languages in our you have the same thing many of the same things implemented in our so I was kind of trying to make sense of all of that energy lost to sharing data and sharing code and thinking about how I could help enable the data world to become a lot less fragmented and people building systems people like me who build tools for people how to make people like me we're building tools a lot more productive and able to build that and more efficient data processing tools in the future and so this was just kind of feelings that I had and so I started to poke around cloudera and see if other people felt the same way and so I was working with folks on the Impala team people like Marcel corn Acker who started the Impala project todd lipkin who started the Apache kudu project it's now Apache and Paula joined the Apache foundation so you know there were a lot of people at Cloudera that essentially agreed with me and we sort of thought about like what kind of technology we could build but help improve interoperability and we sort of centered on the problem of representing data frames and tabular data and as we kind of looked outside of cloud era we saw that there were other groups of developers who concurrently were thinking about the exact same problem so we bumped into folks from the Apache drill project which is a sequel on Hadoop system and they were also thinking about the tabular data interoperability problem like how can we move around tabular data sets and reuse algorithms and code and data without so much conversion and in the energy loss and so very quickly you know we got 20-25 people in the room representing 12 or 13 open-source projects with a general consensus that we should build some technology to proverbially you know tie the room together that became Apache arrow but it took all of 2015 to put the project together now how is all this relevant to data science well what the arrow project provides is a way of representing data and memory that is language agnostic and standardized and portable so you can think of it as being like a language independent data frame so if you create arrow based data frames in Python you can share them with any system whether that's written in C or C++ or JavaScript or Java or rust or go as long as they implement the arrow columnar format they can interact with that data without having to convert it or serialize to some kind of intermediate representation like you usually have so the goal of the project in addition to providing high-quality libraries for building data science tools and building databases is also to improve the portability of code and data between languages outside of kind of the interoperability side of the project there's also the goal within the walls of a particular data processing system to provide a platform of algorithms and tools for memory management and data access that can accelerate large-scale data processing so we wanted the arrow columnar format to support working with much larger quantities of data the single node scale data that is particularly data that does not fit into memory I love this idea of you know tying to room together as you put it cuz essentially it speaks the idea of kind of breaking down the walls between all these silos that exist as well right yeah yeah I know I mean I think if you look across and just within the data science world I mean even though functionally we're solving many of the same problems like there's very little collaboration that happens between the communities whether collaborating at the software design level or at the code level and as a result people point fingers and accuse people of reinventing wheels or like not wanting to collaborate but you know really it's if your data is different in memory there's just no basis for co-chairing in most cases and so the desire to create an open standard for data frames is just if you want to share code it's essential you have to standardize the representation in RAM or on the GPU or essentially at the byte or the bit level agreeing on how what the data looks like once you load it off disk or at once you parse it out of a CSV file is the basis of collaboration amongst multiple programming languages or amongst different data science languages that are ultimately based in C or C++ yeah I remember actually Fernando Perez spoke to this as well in his keynote where you also keynote at the inaugural Jupiter con saying we welcome so many contributions but we need to agree on some things right and these are certain things that we've all agreed upon so if you're going to contribute let's build on these particular things right right yeah I know I think the Jupiter project certainly socialized this idea of open standards by developing the kernel protocol providing a way it's like you know here's the abstract notion of like a computation notebook and here's how if you want to build a kernel add a new language to the Jupiter ecosystem you know here's how you do it and you know that certainly has played out you know beautifully with you know I think it's like over 40 languages have kernel implementation for Jupiter but you know I think in general I think people are appreciating more the value of having open standards where that our community developed and that are developed on the basis of consensus and where there's just like kind of broad buy-in it's not it's like what one developer or one sort of isolated group of people building some technology and then trying to get people to adopt it so I think Jupiter is someone is unique in the sense that it started out in the Python world but I think it's there you know they set out with the goal of embracing a much broader community of users and developers and that's played out in really exciting ways I really like the descriptions you gave and kind of the inspiration behind the arrow project in particular you know the need for interoperability the importance of these portable data frames I don't wanna go too far down the rabbit hole I can't really help myself though I'd like you to speak just a bit more to kind of your thoughts behind the challenge of working in the Big Data limit I mean for example that we have computers and hard drives that can store a lot of stuff but we don't actually have languages that can interact with unless we parallelize it right great right so a common thing that I've heard over the years from people will say Wes like I just want to write pandas code but I want it to work with big data so it's a complicated thing because the way that a lot of these libraries are designed the way the pandas is designed and a lot of libraries that are similar to pandas it's the implementation and kind of the evaluate like the computational model like when computation happens like what are the semantics of the code that you're writing there's a lot of built-in assumptions around like the idea that data fits in memory and that you know when you write a plus B that like a plus B is evaluated immediately and materialized in memory and so if you want to scale out kind of scale up computing to data frame libraries you essentially have to re architect around the idea of deferred evaluation and essentially defining kind of a rich enough algebra or kind of intermediate representation of analytical computation where you can actually use a proper query engine or a query planner to execute operations and so really what is needed is to make libraries like pandas internally more like analytic databases and if you look at all the innovation that has happened in the analytic database world over the last 20 years we call them their databases and things that have happened in the Big Data world you know very little of that of that innovation in scalable data processing has made its way into the hands of data scientists so really you know one of my major goals with working my involvement in the Aero project is to provide the basis for collaboration between the database and analytic database world in the data science world which is just not something that's happened before ultimately the goal is to create like an embedded analytic database that is language independent and can be used in Python can be used an R that can work with much larger quantities of data but it's going to take like a different approach in terms of the user API because I think that this idea of like magically retrofitting pandas or essentially retrofitting pandas with the ability to work with hundreds of gigabytes of data or terabytes of data yeah I hate to say it's a little bit of a pipe dream I think it's going to require some breaking changes and some kind of some different approaches to the problem that's not to say that pandas is going away I mean pandas is not going anywhere and I think is certainly is occupying the the sweet spot of being like the ultimate Swiss Army knife for data sets under few gigabytes so does this conversation relate to the murmurings we've heard of potential pandas to in the pipeline yeah so we at the end of 2015 I started a discussion in the pandas community and so just FYI I think people you know are often thinking why ICC people out in the community of the like west you know thanks so much for pandas I have to remind them like to go out of your way and thank Geoffrey back and yours vandenBosch and phil cloud and tom alex burger and the you know the other pandas core developers that have really been driving the project forward over the last five years I haven't been very involved in the day-to-day developments and sometime in 2013 but at the end of 2015 I started spending some more time with the pandas developers said been building this project for its 7 year you know it's a little over 7 years old the code base are there things that we would like to fix like what are we gonna do about the performance and memory use and scalability issues I can't remember I don't think at that point I don't know that desk data frame existed and so desk has provided kind of an alternative route to scaling pandas by using pandas kind of as is but essentially re-implementing pandas operations using a desk computation graph but looking at the kind of single node scale kind of the in-memory side of pandas we sort of looked at you know what we'd like to fix about the pandas internals and that was what we know we described as the kind of pandas to initiative and around that time we were just getting ready to kick off the Apache arrow project and so I wouldn't say that we you know we reached kind of like a fully baked you know game plan in terms of how to create a quote-unquote pandas too but I think we reached some consensus that we would like to build a evolve data frame library that is a lot simpler in its functionality so shedding some of the baggage of multi indexes and some of the things in pandas that are can be a bit complex and also don't lend themselves very well to out of core you know on like very large not don't fit into memory datasets but something that's focused on dealing with the very large data sets at a single node scale so large out of core just big data sets on a laptop so we are I mean we are working on that and you know I think the project itself is not going to be called handus - just to kind of not confuse people and the pandas project is we all got together the pandas team we all got together in Austin over the summer and this is one of the topics that you know we're gonna continue to grow and kind of innovate and evolve the current pandas project kind of not as it is right now but my goal is to grow a parallel kind of companion project which is powered by the Apache aero ecosystem and provides the pandas light user experience in terms of usability and functionality but is really focused on kind of powering through very large on-disc datasets we'll jump right back into our interview with where this is after a short segment let's now dive into a segment called studies in interpretability I'm here with Patrick oil machine learning engineer and one of the core developers of the open source statistical modeling platform pie mc3 great to have you on the show patter thanks for having me here so we're here to talk about interpretability in building machine learning models and in data science more generally interpretability is telling you why our model makes certain decisions and this is important but it's more important in some areas than others right I mean it'll be more important in insurance and health care for example than in ad tech a space that you've worked in yes he go it's fair to say that interpretive ility matter is less in our tech the cost of showing a wrong ID is very different to say the cost of mispricing insurance policy can you speak to this a bit more from your perspective yeah so an odd tech the bottles I worked on largely involved lots of clever feature engineering and the deployed models were really logistic regression due to the fact that they are easy to paralyze so in our tech we care more about things like predictive accuracy because that's tied directly to the economic impact we don't care as much about explaining the model to internal customers or regulators I mostly agree however you could imagine an algorithm that shows wealthy teenagers ads for colleges but shows minorities ads for bail bondsmen having said that in finance and insurance being able to explain models matters a lot right right the course of a mistake in credit risk models is very high you've loaned for example to a customer or client who defaults I think as we see more applications of AI or ml and surance healthcare and other regulated industries we need to be more mindful of that so can you comment on some work you've seen in those industries well sadly some of the best work I've seen has been under a nondisclosure agreement or NDA one example I saw a model for protecting credit risk for loans the model itself was a random force would lie him on top of it for those of you who don't know lime sounds for locally interpret will model agnostic explanations and lion is basically a toolbox that allows you to get explainable outputs from your black box models in the credit risk model case it was easy to build a framework for handling customer requests in regards why they were flagged up as at risk of default and I was able to convert that information into actionable information such as pay off your credit card debt or pay off your student loans and how about in insurance well basically insurance companies have to allocate reserve capital to compensate for future losses there's a lot of historical work in actuarial community was quite mathematically basic mikuni leveraged newer techniques and using programming languages like orrin stan to compete a loss ratio that is a total mind that will be lost by the insurance company the future claims this is a great example of where a better modeling approach can help you better understand your risk a more complicated model is worth in this case since the use case involves so much risk and so much capital this model was interpreted for example one could see the naturally incorporated uncertainty in the posterior distribution business knowledge was also incorporated furthermore there was increased confidence in the model and the explicit statement of assumptions improved interpretability therefore is clear that this modeling approach can be a useful addition to your toolbox and also can provide insights the traditional machine learning methods can't provide for users who'd love to learn more with these Bayesian techniques then I recommend mix resources search for loss curves case study understand case studies website or my course on probabilistic programming is also excellent that's called probabilistic programming primer thanks for speaking today you patter anything I can do to help the listeners time to get straight back into our chat with Wes McKinney I'd like to step back a bit and think about open-source software development in general I suppose spoiler alert where I want this to go is to talk about your one of your latest ventures or labs but I'm wondering in your mind what the biggest challenges for open-source software development are at this point in time well we can have a whole podcast just about this topic and of course it depends on the stage of a project and all of these the way that I frame the problem when I talk to people is that I think open-source projects face you know funding and sustainability problems of different kinds depending on the stage of the project so I think in the early stages of projects when you're building something new or you're essentially solving a known problem in a different way it can be hard to get support from other developers or financial support to sponsor individuals to work on the project because it's hard to build consensus around something new and there might be like even competing approaches to the same problem and so we're talking about the kind of funding that can support full-time software developers you know it can be a lot of money and so committing a lot of money to support a risky venture into kind of building a new open-source project which may or may not come become successful can be a tough pill to swallow for potential financial backer later on as projects become you know wider adopted they start becoming particularly projects that are so they're foundational and you can call them like I think the popular term is like open source infrastructure there was a report so Nadia Iqbal wrote the report called roads and bridges about kind of open source infrastructure with the Ford Foundation and sort of is about this idea of like thinking about open source software is like a public good roads and bridges and like public infrastructure that everyone uses and you know with public infrastructure it's great because it's supported by tax tumblers but we don't exactly have a open source tax I'm you know I could get behind one but you know we don't have that kind of same kind of mentality around funding critical open-source infrastructure and I think that as projects become really successful and they become something that people can't live without they end up facing the classic tragedy of the Commons problem where people feel like well you know they derive some they derive a lot of value from the project but because everyone used this is a project they don't want to foot the bill of supporting and maintaining the software project so whether you're on the early side of a project or the late you know in the early stage or a late stage I think there's different kinds of funding and sustainability challenges and in all cases I think open-source developers and particularly as projects become more successful you end up quite over burdened and you know burnout risk and I know I've I've experienced a burnout many times and many other open-source developers have have experienced periods of significant burnout so what can listeners who are you know working or aspiring data scientists or data analysts in organizations or c-level people within organizations do for the open sort what would you like to see them do more for the open source well I think I think users and other folks can help with so as people like me I guess I've recently been kind of you know working on putting myself in a situation where I am able to raise money and put to work money that is donated for direct open-source development and so and I think so the best way a lot of people can help is by selling the idea of supporting and either through development work or through direct funding supporting the open source projects that you rely on so I think a lot of companies and a lot of developers are our passive participants in open source projects and so finding a way to contribute whether it's through money or time it is difficult because many open source projects particularly ones that are systems related to infrastructure they don't necessarily lend themselves to casual quote unquote casual contributions so if it's your 5% project or your 20% project it can be hard as an individual to make a meaningful contribution to a project which may have a steep learning curve or just require a lot of intense focus and so I think for a lot of organizations the best way to help projects can be to to donate money directly so I think something this provides a nice segue into your work at Mercer labs I'd love you to just give us a rundown of ursa labs in particular how it frames you know the challenges of open source software development yeah so server so labs is an organization I partnered with Hadley Wycombe from the art community and in our studio to found ursa labs earlier earlier this year the the kind of resin d'etre of versa labs was to to build shared infrastructure for data science in particular building out the aero ecosystem as a the apache aero ecosystem as it relates to to data science and making sure that we have high quality consistent support for all of that new technology in the python and our world and and beyond and improving interoperability for data scientists that use all those programming languages but the particularly gist achill details of versa labs is that we wanted to be able to effectively put together an industry consortium type model where we can raise money from corporations and use that money to hire full-time open-source developers so at the moment you know sorsa labs is being being supported by our studio I to Sigma where I used to work right up until the founding of ursa labs and it's now being funded by Nvidia the makers of graphics cards and so we're you know kind of act working actively on bringing in more you know sponsors to build a larger team of developers and I think it's really confronting that challenge that I think for an engineer at a company as a part time contributor to an open-source project may not be as effective or nearly as effective as a full-time developer and so I want to make sure I'm able to build an organization that is full of outstanding engineers who are working full-time on open source software and making sure that we were able to do that in a scalable and sustainable way and is kind of organized for the benefit of the open source data science world so anyway and I having been through the consulting path and the startup path and working four single companies I think a consortium type model where it's being funded by multiple organizations and where we we're not building a product of some kind it's kind of a new model for doing open source development but one that I'm excited to pursue and see things go yeah I think it's really exciting as well because it does approach a lot of the different challenges one in particular it's a trochee it's a common problem right of developers being employed by organizations and being given a certain amount of time to work on open source software development but that time being eaten away because of different incentives within organization essentially yeah I mean it you know I think there have been ton of contributions to pandas into Apache aero from developers that work at corporations and those contributions mean a lot so definitely still looking for companies to collaborate on the roadmap and to work together to build kind of new computational infrastructure for data science you know I think it's tough when you know the developer might show up and be spending a lot of time for a month or two and then based on their priorities within where the company where they work they might disappear for six months and that's just the nature of things you know I think the kinds of developers that make big contributions to open-source can often be more senior or like gonna be very important developers and their respective organizations and so frequently gets kind of called in to kind of prioritize close source or internal projects that's just kind of you know the ebb and flow of corporate environment so I've got a relatively general question for you what does the future of data science tooling look like to you well speculative of course but you know I think by spending my time on arrow project you know my objective and what I would like to see happen in data science tooling is a defragmenting of of data and code so to have increased standardization and an adoption of open standards like the arrow columnar format storage formats like park' and or protocols for messaging like G RPC so I think that in the future I believe that things will be a lot more standardized and a lot less fragmented kind of a slightly crazy idea I don't know how crazy it is but I think also in the future that programming languages are going to diminish and importance relative to data itself and common computational libraries this is kind of a self-serving opinion but I do think that if to be able to leave data in place and to be able to choose the user interface namely the programming language is the programming language that best suits your needs in terms of interactivity or software development or so forth that you know you can use multiple programming languages to build an application or pick the programming language that's you know that you prefer while utilizing common libraries of algorithms common query engines for processing that data and so I think we're beginning to see kind of murmurings of this defragmentation happening and I think the arrow project is kind of kick along this process and socialize the idea of what a more defragmented and more consistent user experience for a data scientist what that might look like that's a very exciting future so my last question for you is is do you have a final call to action for ant for our listeners out there yeah I would say my call to action would be to find some you know meaningful way to you know contribute to the open source world whether it's sharing your ideas or sharing your use cases about what parts of you know the open source stack are working well for you or what parts you think could serve you better if you are able to contribute to projects whether through discussions on mailing lists or github or commenting on the roadmap or so forth you know that's all very valuable I think a lot of people think the code is the only real way to contribute to open source projects but actually you know I spend a lot of my time it's not writing code it's reviewing code and kind of steering discussions about design and roadmap and future scope and I think the more voices and the more people involved to help build consensus and kind of help prioritize the work that's happening in open source projects helps you know make healthier and more productive communities and if you do work in an organization that has the ability to donate money to open-source projects you know I would love to see worldwide corporations effectively tithing a portion of profits to fund open source infrastructure I think if corporations gave you know a fraction of one percent of their profits to open source projects the funding and sustainability crisis that we have now would essentially go away and obviously I guess that's might be a lot to ask but I can always hope so any corporations can lead by example certainly if you do donate money to open source projects you should you know make a show of that and make sure that other corporations know that you're a good citizen and you're helping support the work of open source developers I couldn't agree more where's it's been an absolute pleasure having you on the show thanks you go fun thanks for joining our conversation with Wes about pandas data analysis tooling in general the future of data science and the challenges of open source software development where's stated that he thinks in the future that programming languages are going to diminish in importance relative to data itself and common computational libraries and his work on Apache Aero is central to this vision the concept of portable data structures that are accessible from a variety of programming languages and that can leverage the vast computational power we now have to work in the limit of at least hundreds of gigabytes many popular data science tools such as pandas in general do not effectively leverage modern hardware one of ursa Labs goals is to empower and accelerate the work of data scientists through more efficient and scalable in-memory computing we also discussed Ursa Labs which I am so excited about and how the model of open-source software development here is to effectively put together an industry consortium type model where they can raise money from corporations and use that money to hire full-time open source developers also get ready for next week's episode our 2018 season one finale a conversation with kathy o'neill data scientist investigative journalist consultant algorithmic auditor and the author of the critically acclaimed book weapons of mass destruction Cathy and I will discuss the ingredients that make up weapons of mass destruction which are algorithms and models that are important in society secret and harmful from models that decide whether you keep your job the credit card or insurance or algorithms that decide how we're police sentenced to prison or given parole Cathy and I will be discussing the current lack of fairness in artificial intelligence how societal biases are perpetuated by algorithms and how both transparency and auditability of algorithms will be necessary for a fairer future what does this mean in practice join us next week to find out as Kathy says fairness is a statistical concept it's a notion that we need to understand at an aggregate level and moreover data science doesn't just predict the future it causes the future you'd best tune in for this our final episode of season one of data framed I didn't intend for that to sound threatening but it came across so I'm your host Hugo Bound Anderson you can follow me on Twitter at Hugo Bound and data camp at data camp you can find all our episodes and show notes sedate account.com slash community slash podcast you\n"

#49 Data Science Tool Building (with Wes McKinney)

Random Videos