Machine Learning:
The Fundamentals

Disclaimer: This article is part of a series covering the fundamentals of AI technologies. Our objective is to provide independent research insights on what these technologies are and how they work. We also provide context regarding their applications and highlight key risks and considerations.

This article is written and published by S&P Global, as a collaborative effort among analysts from different S&P Global divisions. It has no bearing on credit ratings.

Xu Han
Director,
S&P Global Ratings
xu.han@spglobal.com

Sudeep Kesh
Chief Innovation Officer,
S&P Global Ratings
sudeep.kesh@spglobal.com

Disclaimer: This article is part of a series covering the fundamentals of AI technologies. Our objective is to provide independent research insights on what these technologies are and how they work. We also provide context regarding their applications and highlight key risks and considerations.

This article is written and published by S&P Global, as a collaborative effort among analysts from different S&P Global divisions. It has no bearing on credit ratings.

Published: November 29, 2023

Highlights
Machine learning is a subset of artificial intelligence that focuses on the development of statistical algorithms that can perform tasks without explicit instructions. It is responsible for the suite of solutions collectively known as “predictive analytics.”
Machine learning forms the backbone of different types of artificial intelligence. Classifiers and neural networks provide much of the technological firepower that many artificial intelligence (AI) use cases require, ranging from text and image processing to recommendation engines, information extraction, computer vision and natural language processing.
Advancements in machine learning will likely coincide with and even drive the transformative and generative abilities of AI in a new era of productivity and industrial and labor transformation.

Machine learning is a subset of artificial intelligence that focuses on the development of statistical algorithms that can perform tasks without explicit instructions. It’s the basis of a variety of applications in AI, ranging from large language models to computer vision, speech recognition, spatial recognition and more. Machine learning has a rich history — the term was coined more than 60 years ago, and the field is responsible for the suite of solutions collectively known as “predictive analytics.” In this primer, we will describe some common forms of machine learning, dive into a brief account of its history, and help frame various applications, use cases, and risks. We will also take a cursory dive into “deep learning,” a branch within machine learning which supports and augments its capabilities.

AI is generally broken down into two types — discriminative AI and generative AI. Machine learning and deep learning form the backbone of both types. Discriminative AI involves algorithms that analyze historical data and find patterns. Other algorithms can then use these patterns to predict behaviors. Similarly, patterns observed in discriminative AI can be used in generative AI, where algorithms could create new content, be it text, visual media or audio media, among others. These use cases have recently exploded in popularity with tools like ChatGPT and Dall-E.

What machine learning is — and what it’s not

Computer pioneer Arthur Samuel defined machine learning as “a field of study that gives computers the ability to learn without being explicitly programmed.” As mentioned above, machine learning is a subset of AI. Many AI systems are based on machine learning techniques, which include supervised, unsupervised, and reinforcement learning models (see Figure 1).

Figure 1: Categories of machine learning models

Subsets of machine learning

Description

Examples of algorithms

Supervised learning

Uses machine learning techniques to analyze labeled data sets and trains algorithms to properly classify data and predict outcomes

Regression models, decision trees, random forest, support vector machines or SVM, K-Nearest Neighbors (KNN), etc.

Unsupervised learning

Uses machine learning algorithms to analyze and cluster unlabeled datasets

K-means clustering, hierarchical clustering, principal component analysis (PCA), etc.

Reinforcement learning

Uses machine learning training methods based on rewarding desired behaviors and punishing undesired ones

Gaming (e.g., Alpha Go), traffic control, chatbots, auto drive, robotics, etc.

Source: S&P Global.

All about classifiers

While AI does not require machine learning explicitly, machine learning algorithms are typically at the heart of AI operations. Some of the best-known operations are collectively known as classifiers. Classifiers fall into one of four categories, depending on the type of problem the user aims to solve:

Classification algorithms look for patterns in data with filters, either preset or self-generating, which allow the algorithm to place items in groups. For example, email filtering applications use classifiers to differentiate spam messages from legitimate communications. More complex is sentiment analysis, which assesses written text (or speech converted to text) to determine emotional tone (e.g., positive, negative or neutral) and classify text into corresponding groups. Similarly, visual patterns can be assessed in photographs. For example, a classification algorithm could be trained to evaluate photos of dogs to determine their breed. Similar techniques are used for facial recognition. Classification is typically a type of supervised learning algorithm, where the machine learns from labeled data to produce outputs. We’ll discuss the differences between supervised and unsupervised algorithms in greater depth later in this report.

Clustering is similar to classification in that it separates data points with certain properties. However, clustering algorithms are generally unsupervised (the machine learns to map outputs from inputs that don´t have labels, by looking for hidden patterns), creating “clusters” of data points with common properties. Applications include differentiating between fraudulent and legitimate transactions, identifying “fake news” versus true news items, and performing customer segmentation (e.g., grouping frequent purchasers of a certain good or service versus infrequent or non-purchasers).

Dimensionality reduction is similar to clustering in that it is also an unsupervised learning technique, though it is generally motivated by computational and performance considerations rather than revealing the structure of the data. Dimensionality reduction involves reducing the numbers of variables or groups to those that are necessary for a particular problem. Using algorithms of this type is crucial for conserving computational resources and for preventing “overfitting,” a condition in which analysis hews too closely to historical patterns and is thus insufficiently actionable or predictive for modeling future outcomes. Examples include simplifying models used in healthcare applications, which notoriously have vast numbers of variables (e.g., blood pressure, weight, cholesterol levels, vitamin levels, etc.) or in criminal activity or counterterrorism operations, reducing false positives which would otherwise derail productive operations.

Regression/prediction is generally considered a supervised learning technique, best used to predict continuous values of data. The main difference between regression and classification algorithms is that with regression, the output variable is numerical, whereas with classification, the output variable is a category (such as “yes” or “no”). Common uses include predicting the price of a house based on certain features, the likelihood of college admission based on test scores, sales forecasting, weather forecasting, etc. The main metrics for evaluating a regression model are variance, bias, and error, which speak to the model’s anticipated performance quality, given an uncertainty of data outside of its training set (historical data).

A note on risks

Across this series of articles, we aim to facilitate a conversation about risks — risks born of the technologies powering AI, as well as risks born of the use of such technologies in an industrial application. By nature, machine learning has the ability to learn from patterns in data and make decisions based on those patterns to complete assigned tasks.

We’ve explored the concepts of “supervised learning” and “unsupervised learning,” which refer more to target activity and resource allocation than to risk management. Supervised learning, for example, is generally used to classify data or make predictions, whereas unsupervised learning is used to help understand the relationships within data to facilitate predictions (or for quantitative research purposes). Supervised learning is more resource-intensive because it requires data to be classified or “labeled.” Labeling can be performed by another algorithm or by a person, but performance is usually best with a “human in the loop.” While algorithms have advanced exponentially in their performance and abilities, guarding against poor decision-making is not something that happens without careful thought. Similarly, human decision-making, particularly “gut checking,” is complex and difficult to model to ensure reliability.

One underlying risk to manage in these instances is bias. Bias is the result of a model being trained on data that skews decision-making toward one particular outcome. A particularly problematic example would be using machine learning to make decisions on credit applications based purely on historical observations of prior decisions. Such an approach would run the risk of reinforcing past issues with unfair credit decisions by learning from and perpetuating historically enacted bias. On the flip side, machine learning could also be used to help identify biases, as well as discriminatory and unfair practices, which may in turn yield advancements in ensuring fairness and equity in future decision-making.

Another common and important category of risk to manage in machine learning and deep learning is overfitting and underfitting. Overfitting happens when a machine learning model is overly reliant on past events that do not represent the reality of its use in making predictions. Conversely, underfitting happens when there is either too little data or data that does not fit the model result particularly well. The disclaimer used across the finance industry that “past performance is no guarantee of future results” essentially describes the issues of overfitting and underfitting — but these challenges can be managed. Techniques to avoid overfitting and underfitting include choosing high-quality data, using cross-validation, and applying regularization as well as hyperparameter tuning.

As this research series expands, we will address specific risk-related issues that emanate from the application of these technologies in specific contexts. We believe this is a clear and actionable way to provide insights on such risks, bound by a cohesive frame around the specific risk to be managed.

Machine learning algorithms are considered the “core” of artificial intelligence applications. They serve a foundational function, and their importance in many downstream processes requires care to ensure properly balanced “risk managed” outcomes. Machine learning’s rich and storied history also entails a balance: huge leaps in technical advancement and promise set against periods of social doubt and relative drought in research funding.

A brief history of AI, from the Turing test to today’s transformers

Discussions of AI history frequently mention the “Turing test,” a thought experiment conceived by (and named after) British mathematician and computer scientist Alan Turing. The test considers one way to determine whether a machine could exhibit human-like intelligence. Specifically, it contemplates the capability of a machine to generate language-based responses that are indistinguishable from those of a human. In the scenario Turing envisioned, scientists would ask a human to have a typed conversation without knowing whether they were communicating with another person or a machine. If the person were to believe they were talking to a human when in fact they were conversing with a machine, the machine would pass the Turing test.

The modern concept of AI is deeply rooted in Turing’s operational definition of machine intelligence — determining success based on how convincingly a machine can replicate human-like results in performing a task, rather than attempting to directly answer the question of whether a machine can “think.” Fast-forward 70-plus years, and this framework gives us an enhanced understanding of the excitement that arose when GPT-3.5 (aka ChatGPT) was released for public trial. Using a unique application of machine learning foundations and advancements in modeling, the technology produces results that, by the parameters of the Turing test, are quite convincing.

In that 70-year span between the Turing era and today, several major phases of machine learning development led to the current precipice of AI ubiquity.

Many of the foundations of machine learning were established in the 1960s and 1970s. With the realization that handcrafting all computing rules would become unsustainable, computer science shifted toward teaching computers to “learn” from data. At that time, the concept of a "neuron" was introduced, and early perceptron algorithms were developed. The perceptron, put simply, is a type of network in which each input is connected with every other, with allocated weights that show the strength of each connection. Changes to the weighted connections would result in different outputs.

Despite such advancements in computational technology, funding and interest dropped for a time in the 1970s and 1980s — and thus began a period sometimes referred to as “AI winter.” Nevertheless, development continued with more and more powerful algorithms, including two now-famous types of algorithms known as “neural networks” developed in the 1980s: neural networks with backpropagation and recurrent neural networks. Neural networks are algorithms that teach computers how to process data in a way that mimics neural processes in the human brain. In the 1980s, when this concept was developed, the work was largely theoretical, as it was difficult to obtain enough data and amass adequate computing resources to develop these concepts in the field.

Roughly 15 years later, the digital era sowed the seeds of further advancements in machine learning models, making them both more viable and increasingly essential. In this phase, ensemble methods were developed, which stacked multiple machine learning models together. Significant examples included “random forests” and “support-vector networks."

In the 1990s and 2000s, with the advent of “big data” and thanks to advancements in computational power, significant breakthroughs occurred in machine learning and AI, especially in neural networks. New processes of machine learning, known as “deep learning,” (because the architecture was deeper and more complex, containing several hidden layers between the input and the output) would enable data processing on a deeper, more accurate, and more flexible level. Benchmarks that had been frozen for decades improved dramatically across almost all the classic applications, such as machine translation in natural language processes and image classification in computer vision.

Most recently, the transformer architecture (encoder–decoder model) has given birth to a growing list of “killer apps” since its introduction in 2017 in a paper by Google Labs researchers titled "Attention Is All You Need." The transformer has become the foundation for many subsequent models in natural language processing, such as BERT, T5, GPT, and more.

Relationships between AI, machine learning, and deep learning

People are often confused about the definitions of AI, machine learning, and deep learning. This is understandable, as the concepts are quite complex.

To put it succinctly (and at the expense of some degree of precision), one can think of artificial intelligence as a broad domain that consists of an amalgamation of different systems, collectively designed to perform tasks typically associated with human intelligence (e.g., visual perception, speech recognition, decision-making, and language translation). Machine learning is a subset of AI in which machines use data to learn and make decisions to facilitate those higher-level tasks that fall under the AI umbrella. Deep learning is a further subset of machine learning, characterized by many layers and substantial complexity. Of all current technologies, deep learning bears the closest resemblance to the structure and operation of a human brain. The relationship between the three concepts is illustrated in the diagram below.

Types of neural networks

After AI and machine learning, we now focus on deep learning, with its various types of neural networks whose architecture and performance offer distinct advantages for specific use cases. The three types we will explore include recurrent neural networks, convolutional neural networks, and transformers.

Recurrent neural networks (RNNs) are neural networks that can “remember” past events and make predictions about the future. RNNs analyze patterns in data (input), log sequences (hidden layer), and make logical predictions using those patterns and sequences in tandem (output). RNNs are commonly used in tasks such as language translation, handwriting and typing recognition, and recommendation engines for e-commerce, music, and video libraries.

RNNs have sequential features and were designed to maximally learn semantic information in a sequential context. When making predictions, an RNN utilizes all the context from its prior inputs. Therefore, the more steps a decision requires, the harder it is for a recurrent network to learn how to make the decision. This also makes it difficult for RNNs to fully take advantage of modern fast-computing devices, such as tensor processing units (TPUs) and graphics processing units (GPUs), which excel at parallel rather than sequential processing. The limitations of their architecture make it too complex and too expensive to train RNNs on very large volumes of text or language data.

Convolutional neural networks (CNNs) are a bit more complex than RNNs. They are used commonly in image processing to recognize and differentiate people and objects within images. We use a common image processing problem to illustrate the steps the CNN undertakes. First, the CNN looks at several pictures, focusing on patterns observed in each picture. The CNN breaks down the patterns in different dimensions (shapes, colors, and other observable patterns) independently, catalogs them, and puts them together using a technique called “pooling.” The network then flattens the different dimensions and makes decisions using its assessed patterns to classify objects.

Put differently, CNNs split the context of an image into small segments and extract the features from such spatial data, rather than sequentially processing data as RNNs do. The number of steps required to combine information from distant segments of the input still grows with increasing distance. CNNs are efficient and effective for spatial data, such as image and video processing, but they lack the capability to preserve information from sequential data, such as text and spoken language.

The transformer, one of the newest types of neural networks, works similarly to the prior two categories in many ways, with the key difference that transformers do not use sequential processing or convolutional filters. Instead, the transformer assesses multiple dimensions at once using a concept called “attention” to drive its “understanding of context.” The popular ChatGPT application is based on a transformer architecture (“GPT” stands for generative pre-trained transformer). A typical transformer architecture uses an arrangement of encoders and decoders so that information is simultaneously passed between each encoder/decoder pair in a highly parallel fashion. For this reason, GPUs can be used more efficiently to increase the speed of calculations and training. This explains the rapid rise of GPU manufacturers, such as Nvidia.

Other characteristic features include positional encoding and self-attention. Positional encoding refers to the indexing of words, images, or other data in segments by their original location as they pass through the network. Self-attention refers to the ability of the network to identify the most important features in the data and their relationships to other data. These performant networks allow for relatively fast training on a giant corpus of data, such as the contents of the internet itself.

Transformer networks are different from traditional RNNs and CNNs in several ways: the encoder-decoder structure, use of attention mechanisms (e.g., self-attention and multi-head attention), and parallel processing to handle input sequences via positional embedding (accounting for the context of input and its position). Similar to human beings, transformers focus on “important” information in the input context via attention mechanisms. A transformer can learn to immediately attend to a word it is looking for and make a decision in a single step, facilitating simultaneous exchanges between its encoder and decoder structures and yielding remarkable efficiency. Since the transformer addresses key drawbacks of both CNNs and RNNs, it is now the dominant approach in language-understanding tasks.

Applications for machine learning

While machine learning concepts and techniques have been in the process of ideation and development for nearly a century, the ability to traverse the theoretical and arrive at practical applications has largely been contingent upon advancements in computing power and the availability of training data. Today, machine learning can be used to process information obtained from computational data as well as telemetric sensor data from the internet of things (IoT), including instrumentation embedded in machinery and even medical devices.

Using machine learning processes to quickly harvest data, identify patterns, estimate probabilities, and generate forecasts allows for analytical applications across a broad range of industries.

In finance, algorithmic trading and fraud detection are popular examples of how machine learning is used to make trading decisions and identify unusual patterns in large swaths of data. In healthcare, disease diagnosis, epidemiology applications, and advancements in performant medical devices all use machine learning. In energy production and distribution, machine learning forms the basis of “smart grids,” in which telemetric instrumentation provides machine learning algorithms with real-time data, or “sentient grids,” in which algorithms are enabled to make decisions about power distribution (e.g. optimizing for demand surges or cost). In media and entertainment, machine learning models are used to power content recommendation engines, suggesting movies, TV shows, and other content to viewers, based on their prior choices.

These examples of sector-specific machine learning use cases already affect our daily lives, but they are just the beginning. The list of applications and the scope of their influence will continue to grow.

Related research

Foundation Models Powering Generative AI: The Fundamentals, Nov. 28, 2023, S&P Global.

External research

Breiman, L. (2001). "Random Forests." Machine Learning, 45, 5-32.

Cortes, C., & Vapnik, V. (1995). "Support-vector networks." Machine Learning, 20, 273-297.

Hopfield, J. J. (1982). “Neural networks and physical systems with emergent collective computational abilities.” Proceedings of the National Academy of Sciences. 79(8), 2554-2558).

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). “Learning representations by back-propagating errors.” Nature, 323(6088), 533-536.

Samuel, A. (1959). “Some Studies in Machine Learning Using the Game of Checkers.” IBM Journal of Research and Development, 3(3), 210-229.

Tegmark, M. (2017). Life 3.0: Being Human in the Age of Artificial Intelligence.

Turing, A. M. (1950). "Computing Machinery and Intelligence." Mind, 59(236), 433-460.

Vaswani, A., et al. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems, 30.

Contributors

Miriam Fernández, CFA
Associate Director,
S&P Global Ratings,
miriam.fernandez@spglobal.com

Subsets of machine learning	Description	Examples of algorithms
Supervised learning	Uses machine learning techniques to analyze labeled data sets and trains algorithms to properly classify data and predict outcomes	Regression models, decision trees, random forest, support vector machines or SVM, K-Nearest Neighbors (KNN), etc.
Unsupervised learning	Uses machine learning algorithms to analyze and cluster unlabeled datasets	K-means clustering, hierarchical clustering, principal component analysis (PCA), etc.
Reinforcement learning	Uses machine learning training methods based on rewarding desired behaviors and punishing undesired ones	Gaming (e.g., Alpha Go), traffic control, chatbots, auto drive, robotics, etc.

Machine Learning: The Fundamentals

What machine learning is — and what it’s not

All about classifiers

A note on risks

A brief history of AI, from the Turing test to today’s transformers

Relationships between AI, machine learning, and deep learning

Types of neural networks

Applications for machine learning

Related research

External research

Contributors

Machine Learning:
The Fundamentals