Here’s an interesting article on how to represent a categorical feature, with 100’s of levels, in a model in R.

In this post, we will discuss using an embedding matrix as an alternative to using one-hot encoded categorical features for in modeling. We usually find references to embedding matrices in natural language processing applications but they may also be used on tabular data. An embedding matrix replaces the spares one-hot encoded matrix with an array of vectors where each vector represents some level of the feature. Using an embedding matrix can greatly reduce the memory needed to handle the categorical features.

My latest MSDN article is now available.

This month, I explore R and the TidyVerse.

Loading Data with readr The readr package provides a fast and easy way to read rectangular data files, such as .csv files. It can flexibly parse many types of data files, while handling errors robustly. To get started, create a new R language Jupyter Notebook. For details on Jupyter […]

Yes, you read the title correctly.

Keras and Deep Learning are not just for Pythonic peoples, R developers can play along, too. Here’s a great article on how to use Keras from R.

 This talk introduces you to using Keras from within R, highlighting the packages and supporting tools (and some unique tools) available that make R an excellent option for deep learning

Here’s an interesting read on the 4 most important big data programming languages: Python, R, Scala, and Java. While debates over programming languages tend to quickly devolve into shouting matches, this article seems quite level-headed.

Programming languages, just like spoken languages, have their own unique structures, formats, and flows. While spoken languages are typically determined by geography, the use of programming languages is determined more by the coder’s preference, IT culture, and business objectives. When it comes to data science, there are four programming […]

In my latest column in MSDN Magazine, I explore R and what makes it a powerful and elegant language for exploring and manipulating data.

A robust developer community has emerged around R, with the most popular repository for R packages being the Comprehensive R Archive Network (CRAN). CRAN has various packages that cover anything from Bayesian Accrual Prediction to Spectral Processing for High Resolution Flow Infusion Mass Spectrometry. A complete list of R packages available in CRAN is online at Suffice it to say that R and CRAN provide robust tools for any data science or scientific research project.

Josh Gordon sits down with J.J. Allaire, the founder of RStudio. They discuss TensorFlow and Keras support in R, and the educational resources available for R developers new to deep learning. Learn more about the R interface to Keras, TensorFlow Estimators, and the Core TensorFlow API that allows the R community access to many machine learning tools.

Just when you thought Azure Databricks couldn’t get any better, watch this video where Yatharth Gupta, Principal Program Manager for Azure Databricks, talks about the newly introduced integration with R Studio.

For data scientists looking at scaling out R-based computing to big data, Azure Databricks provides the best way scale out their R models with Spark, that is easy to setup and integrates with the most popular R tools and frameworks. Data scientists can use Azure Databricks and R Studio to easily create analytics models, quickly access and prepare high quality data sets, and automatically run R workloads at unprecedented scale.