Databricks just livestreamed this tech talk earlier today.

Developers and data scientists around the world have developed tens of thousands of open source projects to help track, understand, and address the spread of COVID-19. Given the sheer volume, finding a project to contribute to can prove challenging. To make this easier, we built a recommendation system to highlight projects based off of inputted programming languages and keywords.

This talk will go through the full cycle of implementing this system: gathering data, building/tracking models, deploying the model, and creating a UI to utilize the model.

Here’s an interesting interview with the team behind Julia, an up-and-coming language for data science and AI.

At the same time, Julia is general purpose, and provides facilities for creating dashboardsdocumentationREST APIsweb applicationsintegration with databases, and much more. As a result, Julia is now seeing significant commercial adoption in a number of industries. Data scientists and engineers across industries not only use Julia to develop their models, but are able to deploy their programs to production with a single click using Julia Computing’s products.

Databricks talks about the latest developments and best practices for managing the full ML lifecycle on Databricks with MLflow.

Part 1: Opening Keynote and Demo

  • MLOps and ML Platforms State of the Industry, opening Keynote with Matei Zaharia, Co-founder and CTO at Databricks and Clemens Mewald, Director of Product Management at Databricks – https://youtu.be/9Ehh7Vl7ByM – Slideshare: https://www.slideshare.net/databricks/mlops-virtual-event-building-machine-learning-platforms-for-the-full-lifecycle
  • Operationalizing Data Science & ML on Databricks using MLflow (Demo) with Sean Owen, Principal Solution Architect at Databricks – https://youtu.be/cxAmu9w8BFo
  • Live Q&As – https://youtu.be/AQqqK5hRY5g

Resources:

Databricks hosted a four part learning series: Introduction to Data Analysis for Aspiring Data Scientists. This is the third of four online workshops for anyone and everyone interested in learning about data analysis. No previous programming experience required.

Part 3: Machine Learning with scikit-learnIf you want to join the live conversation on zoom, follow the link on our online meetup: https://www.meetup.com/data-ai-online/events/269838467/

Abstract: scikit-learn is one of the most popular open-source machine learning libraries among data science practitioners. This workshop will walk through what machine learning is, the different types of machine learning, and how to build a simple machine learning model. This workshop focuses on the techniques of applying and evaluating machine learning methods, rather than the statistical concepts behind them. We will be using data released by the New York Times (https://github.com/nytimes/covid-19-data). Prior basic Python and pandas experience is required.

Who should attend this workshop: Anyone and everyone, CS students and even non-technical folks are welcome to join. Please note, prior basic Python experience is recommended.

What you need: Although no prep work is required, we do recommend basic python knowledge.

Related links

While data science has matured in recent years, it’s still not a mature field with well established patterns, practices, etc,

As such, there isn’t a textbook answer for building a successful data science workflow.

Instead, data scientists undertaking new data science projects must consider the specificities of each project, past experiences, and personal preferences when setting up the source data, cleaning the data, modeling, monitoring, reporting and more.

While there’s no one-size-fits-all method for data science workflows, there are some best practices, like taking the time to set up auto-documentation processes and always conducting post-mortems after projects are completed to find areas ripe for improvement.

Databricks recently held a webinar on how they worked with Virgin Hyperloop One engineers.

They discuss the goals, implementation, and outcome of moving from Pandas code to Koalas code and using MLflow. Lots of code, notebooks, demos, etc.

Come hear Patryk Oleniuk, Software Engineer at Virgin Hyperloop (VHO) discuss how VHO has dramatically reduced processing time by 95%, while changing less than 1% of previously single-threaded, pandas-based python code. Attendees of this webinar will learn:

How VHO leverages public and private transportation data to optimize Hyperloop designHow to ‘Sparkify’ (scale) your pandas code by using ‘Koalas’ with minimal code changesHow to use ‘Koalas’ and MLflow for sweeping machine learning models and experiment resultsFeatured SpeakersPatryk Oleniuk, Lead Data Engineer, Virgin Hyperloop OneYifan Cao, Senior Product Manager, Databricks 

Resources:

Slides: https://www.slideshare.net/databricks/from-pandas-to-koalas-reducing-timetoinsight-for-virgin-hyperloops-data

Koalas Notebook: https://pages.databricks.com/rs/094-YMS-629/images/koalas_webinar_code%20-%20Copy.html

This unprecedented lockdown is an opportunity to really dig in and work on data science projects. A lot of folks suddenly have time on their hands which they did not see coming.

Why not utilize this time and work on grooming yourself for your dream data science role?

Overview The ideal time to work on your data science portfolio with these open source projects From datasets on COVID-19 to a collection of AutoML libraries by Google Brain, there’s a lot of data science projects to learn from Introduction We are living in the midst of an unprecedented […]

Towards Data Science highlights this talk from the Toronto Machine Learning Summit, which introduces differential privacy and its use cases, discuss the new component of the TensorFlow Privacy library, and offer real-world scenarios for how to apply the tools.

In recent years, the world has become increasingly data-driven and individuals and organizations have developed a stronger awareness and concern for the privacy of their sensitive data. It has been shown that it is impossible to disclose statistical results about a private database without revealing some information. In fact, the entire database could be recovered from a few query results. Following research on the privacy of sensitive databases, a number of big players such as Google, Apple, and Uber have turned to differential privacy to help guarantee the privacy of sensitive data.