The most popular dataset on Kaggle is  Credit Card Fraud Detection. It’s an easy to understand problem space and impacts just about everyone. Fraud detection is a practical application that many businesses care about.  There’s a also something intrinsically cool about stopping crime with AI.

Here’s an interesting article on how to implement a fraud detection system with TensorFlow, PySpark, and Cortex.

While it would be cool to just build an accurate model, it would be more useful to build a production application that can automatically scale to handle more data, update when new data becomes available, and serve real-time predictions. This usually requires a lot of DevOps work, but we can do it with minimal effort using Cortex, an open source machine learning infrastructure platform. Cortex converts declarative configuration into scalable machine learning pipelines. In this guide, we’ll see how to use Cortex to build and deploy a fraud detection API using Kaggle’s dataset.

Fraud detection, a common use of AI, belongs to a more general class of problems — anomaly detection.

An anomaly is a generic, not domain-specific, concept. It refers to any exceptional or unexpected event in the data: a mechanical piece failure, an arrhythmic heartbeat, or a fraudulent transaction.

Basically, identifying a fraud means identifying an anomaly in the realm of a set of legitimate transactions. Like all anomalies, you can never be truly sure of the form a fraudulent transaction will take on. You need to take all possible “unknown” forms into account.

Here’s an interesting article on doing anomaly/fraud detection with a neural autoencoder.

Using a training set of just legitimate transactions, we teach a machine learning algorithm to reproduce the feature vector of each transaction. Then we perform a reality check on such a reproduction. If the distance between the original transaction and the reproduced transaction is below a given threshold, the transaction is considered legitimate; otherwise it is considered a fraud candidate (generative approach). In this case, we just need a training set of “normal” transactions, and we suspect an anomaly from the distance value.

Based on the histograms or on the box plots of the input features, a threshold can be identified. All transactions with input features beyond that threshold will be declared fraud candidates (discriminative approach). Usually, for this approach, a number of fraud and legitimate transaction examples are necessary to build the histograms or the box plots.