By Pratap Ranade — Jul 19, 2022

Building a simulated economy to train stronger AI

Emergent selling strategies using Deep Reinforcement Learning

Arena was founded to bring breakthroughs in AI research out of the lab and into the real world.

One of the most important breakthroughs in AI research is deep reinforcement learning (RL), which underpins superhuman play in complex games such as chess and Go. At Arena, we use deep RL to optimize and automate business processes

Today, most AI relies on supervised learning. These algorithms make predictions from existing data – e.g., predicting spam or insurance fraud.

In contrast, active learning is a branch of AI that makes decisions, not predictions. These algorithms are curious: they experiment in the real world, observe outcomes, and learn which actions lead to more desirable outcomes.

Reinforcement learning is a natural successor to active learning. It is a way of learning which action to take today to optimize outcomes in a potentially distant future.

The next leap forward in applied AI lies in active learning and deep reinforcement learning. Businesses that integrate these tools into their decision-making will be better equipped to compete in dynamic and complex markets with intelligent adversaries.

In this post, we describe how deep reinforcement learning addresses the limitations of traditional active learning approaches and how businesses can use deep reinforcement learning to find valuable, emergent strategies and implement them automatically.

The limits of active learning

Traditional active learning algorithms, such as bandits, work by trying an action in the real world (e.g., setting a price), observing the outcome (e.g., purchases), and updating a policy to favor actions that lead to favorable outcomes. By making decisions intelligently, active learning represents a major leap forward for the enterprise.

Nonetheless, active learning is limited in two ways. First, an active learner can only learn efficiently from a small set of potential actions (e.g., discounts). The larger the set of potential actions, the longer it will take to identify the best one. In the meantime, the business bears the cost of the algorithm’s experimentation.

A second limitation is that outcomes must follow soon after actions; otherwise, the active learner will not know whether its actions were good or bad. Often, immediate outcomes are poor measures of what the enterprise values in the long term. Consider discounts. If active learners are trained to maximize revenue or profit in the short term, they may offer steep and frequent discounts to spur purchases. However, these discounts may cannibalize future sales by inducing customers to stock up. They might also train customers to expect discounts in the future, thereby eroding pricing power and devaluing brands.

Reinforcement learning is not bound by these limitations

Reinforcement learning agents choose actions to optimize outcomes in a possibly distant future, such as discounted profit over a long horizon. This horizon can be determined by the organization’s objective rather than by the learner’s requirements. In this way, reinforcement learning optimizes business processes to achieve long-term goals.

The tradeoff in using reinforcement learning rather than traditional active learning algorithms is that reinforcement learning agents cannot be trained in the real world. If the agent cares about the consequences of its actions over the following year, a year will pass before the agent learns whether its actions today were good or bad.

Our innovation is to train RL agents in simulators that mimic supply and demand dynamics in a real-world market. In these simulators, reinforcement learning agents live out many lifetimes, observing the distant consequences of their actions and determining which actions optimize outcomes over a long horizon.

Training RL agents offline (i.e., in simulators) rather than online (i.e., in the real world) dramatically reduces the cost of exploration. Rather than experiment with a pricing strategy for a year in the real world, the agent tries out millions of pricing strategies in minutes, learning which is best without ever trying the worst on real customers.

How do you create the simulator?

The value of the agent depends on the realism of its simulated training environment. We use imitation learning to mimic customer behavior as observed in historical data, and we train agents to set prices, provide recommendations, or choose other actions in a manner that optimizes simulated outcomes. This demand-side model can be combined with a supply-side model to account for supply considerations, such as stockouts, when choosing actions.

Our approach to modeling customer behavior derives from recent and stunning advances in natural language processing, in which machines, when seeded with an opening sentence, write the rest of the story. The way that a customer interacts with a business is a story too. That story may begin with the customer opening an app. She then navigates through the interface, guided by what she is looking for and by what the platform recommends. At various points, she adds items to her cart, and eventually, she checks out. In the same way that neural networks simulate the middle and the end of the story, our neural networks simulate each customer’s journey through a marketplace. Online marketplaces make this process seamless, as they record the clickstream data we use to tokenize events and train the model.

This design allows the agent to control many aspects of a customer’s online journey, from the discount she is offered to the recommendations she is shown to the notifications she receives on her phone or email. The agent tries out a wide variety of policies for choosing discounts, recommendations, or notifications – or all of these simultaneously – and it prioritizes those that craft the customer experience in a manner that advances the enterprise’s long-term goals.

Start your journey to enterprise autonomy today.

How do you deal with drift?

Drift is a major challenge for machine learning approaches that use historical data. With time, customer preferences change, new products are introduced, and competitors emerge and disappear. As a result, policies that are optimal today may be suboptimal tomorrow.

We address drift by continually retraining the simulator with new data generated by the agent working in the real world, and by retraining the agent in the latest version of the simulator. This process happens in the background, so that emerging trends are quickly captured by the simulator and exploited by the agent.

Incidentally, this approach also addresses another limitation of offline training (i.e., training in a simulator). We might be interested in policies that have never been tried, such as offering a larger discount than has been offered in the past. Unfortunately, the simulator cannot say how a customer might respond to a discount that no customer has seen.

As in active learning, our reinforcement learning agents occasionally try out promising new avenues in the real world. Unlike in active learning, however, the reinforcement learning agent does not learn from this experimentation directly. Rather the data generated from this experimentation are fed back into the simulator. We then retrain the agent in the latest version of the simulator, but with fewer constraints on its exploration.

Deep RL can automate complex business processes

Reinforcement learning is an old idea, but it has been recently supercharged by deep learning – i.e., by representing the agent as a deep neural network. Deep reinforcement learning underlies recent feats of superhuman play in canonical games. At Arena, we are helping businesses play at superhuman levels.

Deep RL is superhuman intelligence. It can generate emergent insights – e.g., to discount deeply but briefly when introducing new products, to discount when inventory accumulates, or to prioritize availability to the most valuable customers when supply is constrained.

Deep RL can also help enterprises navigate daunting levels of complexity. Whereas it may be somewhat straightforward to price one product, many businesses sell multiple products. Some of these products may be substitutes; others may be purchased together. Deep RL implicitly learns how to navigate substitution patterns among many products. It also learns how to simultaneously account for seasonality, demand- and supply-side shocks, customer psychology, and competitor pricing. And it does all this in a manner that is personalized to the customer.

Welcome to the Arena.