Machine Learning 101 : Why, What and How?

In this explainer, we focus on a high-level view of Machine Learning as a field and set of technologies. We explore why there is been an explosion in Machine Learning, how we already use machine learning in our everyday lives, and how it will help us push the frontiers in engineering and science. We then explore:

what machine learning is and how there are different classes of learning based on the task required. we explore the key idea of generalisation as our understanding of "intelligence". we look at different considerations we need to take when choosing which machine learning algorithms we might use in our machine learning for flow. Big data, sparse data, a large number of inputs or outputs or the amount of trust needed in the answer. Finally, I present a few personal thoughts on how machine learning should be learnt.

This lecture is intended to be a presentation, with me as a talking head! Below are a set of summary notes for reference.

Why Machine Learning?

The world has been witnessing an explosion in data generation, with the amount of data increasing by 50 folds between 2010 and 2020, and the total amount estimated to be around 44 ZB today. This phenomenon has been driven by several factors, including the availability of cheap data storage, the power of modern computing devices, and the proliferation of data-generating devices, ranging from sensors to remote systems. In this context, Machine Learning (ML) has emerged as a crucial field for extracting value, insights, and intelligence from this vast pool of information. By finding patterns in data, ML can enable us to analyze and understand complex phenomena, such as user behaviour, consumer preferences, market trends, disease outbreaks, and climate change. Therefore, the growing need for ML is aligned with the need for businesses, governments, and individuals to make sense of this exponential growth of data, and leverage it for innovation, productivity, and social impact.

You use Machine Learning every day

Machine Learning isn't a new thing. It has been well integrated into many of the tools we use every data. Here are some examples

Search Engines (e.g. Google)
Recommender Systems (e.g. Netflix & Amazon)
Speech Understanding (e.g. Siri, Alexa)
Face Recognition
Large Language Models (e.g. ChatGPT)
Self Driving Cars

These are really all cool examples of ML. At digiLab, we are also excited by the opportunities ML has at the frontiers of Engineering & Science from fusion energy, to automated agriculture to medical diagnosis. There are very few areas not being touched by the advances in machine learning.

What is Machine Learning?

When we talk about Machine Learning, it is important to understand that it is not just a single algorithm or workflow. Rather, it refers to a set of technologies that enable machines to learn from data and subsequently perform specific tasks. Machine Learning workflows essentially work as a pipeline that takes in data, trains on it, and then deploys the trained model for specific purposes.

An important or central concept in machine learning is the concept of intelligence in a machine learning algorithm. For me, the "intelligence" we assign is the ability of a Machine Learning algorithm or workflow to perform well on previously unseen tasks. This core concept is called Generalisation.

If an algorithm is just reproducing information it has been given, it is really nothing more than a rich look-up table. The fact that patterns, correlations and connections can be made and then use to bring insight into new cases is central to our understanding of intelligence as humans.

Supervised, Unsupervised & Reinforcement Learning

There are different approaches for different machine learning tasks

They are often not used in isolation, and so a single machine learning workflow may bring together both unsupervised and supervised learning algorithms. This is really common.

Find the patterns in the data to simplify the data representation (unsupervised learning), and from this slimline representation fit a model which makes a good prediction from labelled examples (supervised learning).

Ok so I mention them here so what are the different types of learning

Supervised machine learning is a type of learning where an algorithm is trained using a labellabelledd dataset. In other words, the input data is already tagged with the correct output, allowing the algorithm to learn the mapping between the input and output variables. For instance, if we want to build an algorithm to predict the price of a house based on its size and location, we would provide a dataset containing examples of houses, their sizes, locations, and their corresponding prices. During the training process, the algorithm will learn the relationship between these features and the price and use this knowledge to make predictions on new, unseen data.

On the other hand, unsupervised machine learning is used when the input data is not labelled or categorized. Instead, the algorithm is tasked with finding patterns or structures in the data on its own. This is particularly useful in cases where there is no clear objective or target variable. An example of unsupervised learning is clustering, where the algorithm groups data points together based on their similarity. Another example is dimensionality reduction, where the algorithm reduces the number of input features while preserving the most important ones. Unsupervised learning is often used in exploratory data analysis or in cases where the data is too complex to be easily labelled.

Lastly, reinforcement learning is a type of learning that relies on an agent interacting with an environment and receiving feedback in the form of rewards or punishments. The agent learns by trial and error, and its goal is to find a policy that maximizes the cumulative reward over time. Reinforcement learning has been successfully used in areas such as game playing, robotics, and autonomous driving.

Different Amounts of Types of Data

A central consideration in what methods we deploy in our ML workflow is how much data we have. For companies like Amazon for example capture an huge amount of data about how customers interact with their services. Their machine learning challenges are really about distilling patterns in these huge data sets, and because the data is so rich much less structure needs to be imposed on the methods they use. They allow free to have very flexible approaches, which are then sufficiently constrained by data.

In other applications, data can be very expensive to obtain. A good example that digiLab knows well is the outputs of mathematical simulations for Fusion, where a single simulation can take as long as a week on a national supercomputer (amazing). In this case, because data is limited the machine learning methods chosen need to be robust to limited data.

The Curse of Dimensionality

In simple terms, the Curse of Dimensionality refers to the fact that as the number of features (or dimensions) in a dataset increases, the amount of data required to cover the feature space increases exponentially. This means that as the dimensions increase (models with large numbers of inputs and parameters) it becomes increasingly difficult to find patterns in the data. In other words, most points in a high-dimensional space are very far apart from each other - much further than we would expect in lower-dimensional spaces.

Suppose we have a one-dimensional input space, which takes values between 0 and 1. We ask for three equally spaced data points so that the distance between each data point is just 0.20.2. In two dimensions, the number of data points required to keep the same distance apart is 3^2=9 points.

In much higher dimensions, say 100 (not a big problem for ML) then the number of samples I need to retain a distance of 0.2 between each point is 3^100=5.15e47. This is an eye-watering amount of training data, even for the above example where the dimension of the input space is still relatively small.

Let's think about a common machine-learning problem, in which images are classified. For example, the classic question, "Is this picture, a picture of a Cat?" is a classic supervised (classification) problem. Take for example a picture made up of 640 by 480 pixels, in which each pixel is represented by 3 colour values - Red, Green, and Blue (RGB).

This gives a total input dimension of 640×480×3=921,600 values.

A central consideration in machine learning is how we can reduce the dimension of this input space, but still retain the maximum amount of information we need about the system.

Do the Outputs Matter?

This is a central consideration when thinking about "how good a model has to be?" and influences are decision in

An important consideration in machine learning is to contemplate if the outputs of a model matter. For instance, if a recommender system suggests an unsuitable item for purchase, it may not matter much as the algorithm can learn and recommend a better item the next time.

However, in high-stakes applications (e.g. water quality control, cancer diagnosis, or drone control) the outputs can have serious consequences. In such cases, it is imperative for the application to provide not only predictions but also a measure of uncertainty or risk associated with the predictions. This can aid decision-makers in making informed and responsible choices by weighing the potential risks against the benefits.

If uncertainty measures are needed then this plays a key role in terms of choosing which algorithms are appropriate.

Thoughts on Learning ML

When it comes to learning Machine Learning, there are a few things to keep in mind. First, it's important not to take a bottom-up approach, as this can lead to a lack of direction and understanding of the big picture. Instead, diving into learning on a specific problem allows for more effective learning, as each problem will require different types of machine learning algorithms and approaches.

Second, focus on understanding the big concepts of approaches before diving into the details. This approach is more effective, as the details can often become overwhelming, but understanding the larger concepts can provide a framework for tackling the details.

To learn Machine Learning, it's important to play around with the math and programming concepts by experimenting with different algorithms and approaches. This hands-on approach is a great way to learn through experience and to see how different techniques affect the performance of models.

It's also important to keep in mind that Machine Learning is a fast-moving and ever-growing field, so staying up to date with the latest developments and techniques is crucial for success.

In conclusion, learning Machine Learning is an ongoing process that requires an open mind, a willingness to learn, and a dedication to staying up to date with the latest developments in the field. By keeping these tips in mind, anyone can become an expert in Machine Learning and apply it to a variety of real-world problems.

Introduction to Machine Learning

1. Machine Learning 101 : Why, What and How?