A colleague of mine, Ayman El-Ghazali, worked through data from the state of Maryland.

Code is available on GitHub.

I chose not source my data directly from Maryland’s State Government site because the format was not easy to use. The official Maryland Government provided data basically has each day as a column and had the rows as Zip Codes — not as easy as the data provided from the site above. So there may be few discrepancies between the data on a day to day basis, but the totals are identical. You can read about their methodologies of retrieving data from various official State Government websites and the quality of each.

Databricks just livestreamed this tech talk earlier today.

Developers and data scientists around the world have developed tens of thousands of open source projects to help track, understand, and address the spread of COVID-19. Given the sheer volume, finding a project to contribute to can prove challenging. To make this easier, we built a recommendation system to highlight projects based off of inputted programming languages and keywords.

This talk will go through the full cycle of implementing this system: gathering data, building/tracking models, deploying the model, and creating a UI to utilize the model.

Databricks just posted part 3 of a 3 part online technical workshop series on Managing the Complete Machine Learning Lifecycle with MLflow. If you’re interested in learning about machine learning and MLflow, this workshop series is for you!

Details:

This workshop is an introduction to MLflow. Machine Learning (ML) development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models.

To solve these challenges, MLflow, an open source project, simplifies the entire ML lifecycle. MLflow introduces simple abstractions to package reproducible projects, track results, encapsulate models that can be used with many existing tools, and central repository to share models, accelerating the ML lifecycle for organizations of any size.

Related Links:

Databricks live streamed this interview with Matei Zaharia, an assistant professor at Stanford CS and co-founder and Chief Technologist of Databricks, the data and AI platform startup.

During his Ph.D., Matei started the Apache Spark project, which is now one of the most widely used frameworks for distributed data processing. He also co-started other widely used data and AI software such as MLflow, Apache Mesos and Spark Streaming.

.NET for Apache Spark empowers developers with .NET experience or code bases to participate in the world of big data analytics.

In this episode, Brigit Murtaugh joins Rich to show us how to start processing data with .NET for Apache Spark.

Time index:

  • [01:01] – What is Apache Spark?
  • [02:33] – What are customers using Apache Spark for?
  • [03:50] – What did we create .NET for Apache Spark?
  • [06:30] – Exploring GitHub data
  • [15:012] – Considering data processing in the real world
  • [18:26] – Analyzing continuous data streams

Useful Links

Databricks  recently hosted this online tech talk on Delta Lake.

The General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) both aim to guarantee strong protection for individuals regarding their personal data and apply to businesses that collect, use, or share consumer data, whether the information was obtained online or offline. This remains one of the top priorities for the companies to be compliant and they are spending a lot of time and resources on being GDPR and CCPA compliant.

Your organization may manage hundreds of terabytes worth of personal information in your cloud. Bringing these datasets into GDPR and CCPA compliance is of paramount importance, but this can be a big challenge, especially for larger datasets stored in data lakes.

Databricks just streamed this workshop on managing the machine learning lifecycle with MLflow

Workshop 1 of 3 | Introduction to MLflow: How to Use MLflow Tracking

Level: Beginner/Intermediate Data Scientist or ML Engineer

Details: This workshop is an introduction to MLflow. Machine Learning (ML) development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models.

ThorogoodBI explores the use of Databricks for data engineering purposes in this webinar.

Whether you’re looking to transform and clean large volumes of data or collaborate with colleagues to build advanced analytics jobs that can be scaled and run automatically, Databricks offers a Unified Analytics Platform that promises to make your life easier.

In the second of 2 recorded webcasts Thorogood Consultants Jon Ward and Robbie Shaw showcase Databricks’ data transformation and data movement capabilities, how the tool aligns with cloud computing services, and highlight the security, flexibility and collaboration aspects of Databricks. We’ll also look at Databricks Delta Lake, and how it offers improved storage for both large-scale datasets and real-time streaming data.Whether you’re looking to transform and clean large volumes of data or collaborate with colleagues to build advanced analytics jobs that can be scaled and run automatically, Databricks offers a Unified Analytics Platform that promises to make your life easier.

Databricks talks about the latest developments and best practices for managing the full ML lifecycle on Databricks with MLflow.

Part 1: Opening Keynote and Demo

  • MLOps and ML Platforms State of the Industry, opening Keynote with Matei Zaharia, Co-founder and CTO at Databricks and Clemens Mewald, Director of Product Management at Databricks – https://youtu.be/9Ehh7Vl7ByM – Slideshare: https://www.slideshare.net/databricks/mlops-virtual-event-building-machine-learning-platforms-for-the-full-lifecycle
  • Operationalizing Data Science & ML on Databricks using MLflow (Demo) with Sean Owen, Principal Solution Architect at Databricks – https://youtu.be/cxAmu9w8BFo
  • Live Q&As – https://youtu.be/AQqqK5hRY5g

Resources:

Databricks hosted this webinar introducing Apache Spark, the platform that Databricks is based upon.

Abstract: scikit-learn is one of the most popular open-source machine learning libraries among data science practitioners.

This workshop will walk through what machine learning is, the different types of machine learning, and how to build a simple machine learning model. This workshop focuses on the techniques of applying and evaluating machine learning methods, rather than the statistical concepts behind them. We will be using data released by the New York Times (https://github.com/nytimes/covid-19-data).

Prior basic Python and pandas experience is required.

Previous webinars in the series:

  • Watch Part1, Intro to Python: https://youtu.be/HBVQAlv8MRQ ( to learn about python)
  • Watch Part 2, Data Analysis with pandas: https://youtu.be/riSgfbq3jpY
  • Watch Part 3, Machine Learning: https://youtu.be/g103iO-izoI