It might seem paradoxical to want to “sell” open source, but lo these many years into the open source revolution, many organizations have yet to get the message. 

Press the play button below to listen here or visit the show page.

Listen in as  James Genus helps makes sense of it all, giving us the insights we need to convince the laggards to get onboard.

ABOUT: Microsoft Principal Cloud Solution Architect James Genus Jr. has been working in the technology field for over 20 years. Working across multiple industries including manufacturing, construction, research and technology, open source solutions have been trusted companions helping to solve technical and business challenges.

LINKS: OpenSource.net, Open Source Blog (Microsoft), Open Source Initiative

CREDITS: Louis Berman (Host); James Genus (Guest); Dan Phillipson / PremiumBeat (Music); Anne Lamb (Intro/Outro); East Coast Studio (Editing)

MORE: visit https://azure-success.com for additional episodes, plus transcripts, and more ways to listen to the show. As to your comments and suggestions, please feel free to email your host, Louis Berman, at lberman@microsoft.com

There was once a time when folks pondered whether or not open source would be a viable business model.

Today, that sounds comical, a there are numerous open-source tech companies today, some of which have gone beyond $100 million (or even $1 billion) in their annual revenue including RedHat, MongoDB, Cloudera, MuleSoft, Hashicorp, Databricks (Spark) and Confluent (Kafka).

Why do tech companies open source their products?

“Open-source is an enabler of innovation, giving organisations access to a global pool of talent and the tools to develop secure, reliable and scalable software – fast. The organisations that are most effectively speeding up business transformation are those who have turned to open-source software development to succeed in a fast-changing, digital world,” told Maneesh Sharma, General Manager of Github India in an interview with Analytics India Magazine.

Databricks, the company behind the commercial development of Apache Spark, is placing its machine learning lifecycle project MLflow under the stewardship of the Linux Foundation.

MLflow provides a programmatic way to deal with all the pieces of a machine learning project through all its phases — construction, training, fine-tuning, deployment, management, and revision. It tracks and manages the the datasets, model instances, model parameters, and algorithms used in machine learning projects, so they can be versioned, stored in a central repository, and repackaged easily for reuse by other data scientists.

Here’s an interesting and thoughtful look the role open source plays in AI development.

Think about it: just about every major AI tool is open source. TensorFlow being the most obvious example.

The biggest advantage of open-source AI systems is that they involve no licensing fee for using open-source AI systems. And that’s especially useful for people having little or no experience in IT infrastructure. But what’s free is not always free. Although open-source AI comes without any license fee, it has other hidden costs like commercial software such as training, implementation, and maintenance costs. But, open-source AI can also reduce these costs.

Here’s an interesting talk on Dask from AnacondaCon 2018.

Tom Augspurger. Scikit-Learn, NumPy, and pandas form a great toolkit for single-machine, in- memory analytics. Scaling them to larger datasets can be difficult, as you have to adjust your workflow to use chunking or incremental learners. Dask provides NumPy- and pandas-like data containers for manipulating larger than memory datasets, and dask-ml provides estimators and utilities for modeling larger than memory datasets.

These tools scale your usual workflow out to larger datasets. We’ll discuss some of the challenges data scientists run into when scaling out to larger datasets. We’ll then focus on demonstrations of how dask and dask-ml solve those challenges. We’ll see examples of how dask can expose a cluster of machines to scikit-learn’s built-in parallelization framework. We’ll see how dask-ml can train estimators on large datasets.AnacondaCon 2018. Tom Augspurger. Scikit-Learn, NumPy, and pandas form a great toolkit for single-machine, in- memory analytics.

Scaling them to larger datasets can be difficult, as you have to adjust your workflow to use chunking or incremental learners. Dask provides NumPy- and pandas-like data containers for manipulating larger than memory datasets, and dask-ml provides estimators and utilities for modeling larger than memory datasets. These tools scale your usual workflow out to larger datasets. We’ll discuss some of the challenges data scientists run into when scaling out to larger datasets. We’ll then focus on demonstrations of how dask and dask-ml solve those challenges. We’ll see examples of how dask can expose a cluster of machines to scikit-learn’s built-in parallelization framework. We’ll see how dask-ml can train estimators on large datasets.

Microsoft for Startups shares this highlight reel from the Spring MLADS conference.

In case you’re not familiar with MLADS, check out Data Driven’s coverage of the most recent one.

Twice a year, Microsoft assembles over 4,000 of our top data scientists and engineers for a two day internal conference to explore the state of the art around machine learning and data science.

Earlier this year, 30 leading startups who are active in the Microsoft for Startups program came to showcase their solutions and engage directly with the engineering teams.

Talk about cold storage.

The GitHub Arctic Code Vault is a data repository preserved in the Arctic World Archive (AWA), a very-long-term archival facility 250 meters deep in the permafrost of an Arctic mountain. The archive is located in a decommissioned coal mine in the Svalbard archipelago, closer to the North Pole than the Arctic Circle. GitHub will capture a snapshot of every active public repository on 02/02/2020 and preserve that data in the Arctic Code Vault.