Advertisement

AI finally beats Ms. Pac-Man; a first for man and machine

Muluuba, Microsoft subsidiary, uses reinforced learning to train AI through trial and error of over 800 million frames

Stock_MsPacman

By Heather Hamilton, contributing writer

Canadian company Muluuba, now a subsidiary of Microsoft, recently cracked Ms. Pac-Man. The team is the first team of AI programmers to beat the game in its 36-year history — in fact, they're the first human or machine to achieve such status.

Tech Crunch reports that, in 2015, Google’s DeepMind beat 50 Atari games, a notable exception being Ms. Pac-Man, which has a lot of boards and moving parts. Maluuba achieved the formerly impossible task by dividing and conquering. By breaking the game into more manageable tasks assigned to individual AI agents, they were able to complete the impossible.

“When we decomposed the game, there were over 150 agents working on different problems,” said Rahul Mehrotra, Maluuba program manager. Each fruit palate has a specific agent assigned, and ghosts had four. Edible ghosts warranted four more. All agents worked parallel to one another, seeding their reward to the high levels agent and then making a decision about how best to proceed. In a paper on their findings, they refer to this system as Hybrid Reward Architecture, or HRA.

According to Microsoft, “The top agent took into account how many agents advocated for going in a certain direction, but it also looked at the intensity with which they wanted to make that move. For example, if 100 agents wanted to go right because that was the best path to their pellet, but three wanted to go left because there was a deadly ghost to the right, it would give more weight to the ones who had noticed the ghost and go left.”

The AI system also used reinforcement learning, in which every action corresponds with a positive or negative response, allowing agents to learn through trial and error. The process was trained through the use of over 800 million frames.

The system exploited domain knowledge in one of three ways. They could remove irrelevant features, which are those that do not affect the reward, and identify terminal states, states from which a reward is no longer possible. Finally, they could use pseudo-reward functions, which, according to the paper, function rather differently: “Instead of updating a head of HRA using a component of the environment reward, it can be updated using a pseudo-reward. In this scenario, a set of GVFs is trained in parallel using pseudo-rewards. Each head of HRA uses (an) appropriate GVF(s). This can often result in more efficient learning.”

Mehorotra believes that beating Ms. Pac-Man is a lot like running a company — larger goals are achieved by dividing employees up into smaller teams, which come work together like cogs in a machine. Each team has a goal they’re working toward, but each is also working toward a similar end. 

CTO, Kaheer Suleman, believes that breaking things down into smaller, more manageable parts is the basis of human problem solving. “A company doing product development is a good example,” he said. “The goal of the whole organization is to develop a product, but individually, there are groups that have their own reward and goal for the process.”

In an article on BBC News, Prof. Nello Cristianini, a computer scientist from the University of Bristol, cautions, “It is exciting that so much progress is happening today in AI; however, we should remember that historically, AI has not always been able to replicate results in games when transferring methods to real-world problems. This should be kept in mind whether we talk about Jeopardy!, Chess, Go, or Ms. Pac-Man.” This implies that existing AI relies on brute-forcing solutions by testing all possible actions, rather than synthesizing information and making a creative decision.

Mehrotra believes that there’s a possibility in the adaptation of a similar system in retail, in which AI would help human sales representatives prioritize customer help to best impact revenue. And, while the published paper doesn’t exactly elaborate on what those real-world implications might be, they do say this: “We argue that many real-world tasks allow for reward decomposition. Even if the reward function can only be decomposed in two or three components, this can already help a lot, due to the exponential decrease of the problem size that decomposition might cause.”

Sources: Tech Crunch, Cornell University Library, BBC News, Microsoft
Image Source:
Wikimedia

Advertisement



Learn more about Electronic Products Magazine

Leave a Reply