Project Description

The purpose of this project was to see if a traditional hard-coded robot controller could be replaced with a neural network that was evolved using a genetic algorithm. Although the neural network was extremely simple (containing only 6 neurons), it produced many complex behaviors including obstacle avoidance, waypoint seeking, strategy building, pursuit, cooperation, and evasion. Videos recorded from the simulation are shown on the pages that follow. The biological scenarios were especially interesting in that they showed the emergence of life-like behaviors that resembled predators, prey, mutualism, and evasion. A more detailed description of the neural network controller is provided in the paper below.

The Autonomous Acquisition of

Robot Control through Neuroevolution

Introduction

Designing robots that autonomously respond to dynamic environments has been such a difficult challenge that it was largely ignored until relatively recently. In solving this problem many researchers have abandoned the traditional techniques of rigid plans and hard-coded controls and have replaced them with sensor-dependent, highly reactive behaviors. This offers the robot a distinct advantage over planning-based strategies, which usually require perfect world models and perfect localization techniques for safe locomotion. Instead, reactive robots only need the ability to sense and respond to their immediate surroundings. If surroundings change, a robot can simply react to the new situation.

Typically, the design of reactive control is accomplished using a set of simple behaviors, where each behavior is very effective at accomplishing a very specific goal. Such behaviors include avoiding obstacles, wall following, seeking out landmarks, seeking out open spaces, seeking out other robots, etc. Depending on the design strategy, robots may switch between its various behaviors, or it may use a weighting scheme to determine the extent that each behavior determines its actions. Although this approach is more flexible than traditional methods, it still requires a considerable amount of hard coding and heuristics. Researchers must engage in the laborious process of separately coding each of the simple behaviors and then determine an appropriate way to fuse the behaviors under given environmental conditions. Each of these steps can be time-consuming, error prone, and subject to researcher biases and impatience.

This project shows how a genetically evolved neural network could take the idea of reactive control a step further to eliminate the laborious process of coding and fusing behaviors. Instead of coding the robot to react to its environment, a genetic algorithm could allow a naïve robot to be placed in an environment, where it would react to the efficacy of its previous responses to its environment. A robot that finds itself doing well would continue what it is doing, while a poorly performing robot could try to change. The only thing a researcher needs to do for training a new task, is to assign a set of criteria that the robot could use as feedback. For example, to train a robot to follow another robot, the researcher only needs to specify a following distance that will result in a reward for the robot.

This project used simulated environments to demonstrate that robots can independently learn tasks such as safely moving about their environment, waypoint seeking, strategy building, pursuit, evasion, and cooperation. Robots demonstrated the ability to learn tasks from scratch, as well as the ability for trained robots to learn additional skills. To emphasize this similarity between robot evolution and biological evolution, additional simulations were created to show the emergence of life-like behaviors including predator-prey, mutualistic, and parasitic relationships.

Robot Simulation

Although the current implementation is entirely in simulation, the simulation has been designed to closely resemble a real environment and it is believed that robots trained in the simulation should fair reasonably well in the real world. The map used for the simulation was generated from laser data acquired by driving a robot through the halls of the computer science building. The drivable area is about 90 x 50 meters and is represented at a scale of 1 pixel per .125 meters. This map is shown as figure 1. The rough color texture is only used to aid the researcher in location identification and is not used by the robot.

Simulated robot environment

Map Entities

Since the goal was to train robots to operate robustly in a dynamic setting, eight different robots were trained in the same environment, at the same time. Each robot was considered to be an obstacle to every other robot and when a collision would occur, both involved robots would lost points as a consequence. Graphic representations of the map’s components are as follows.

Living robots can be one of two types represented by red and green. In many of the simulations the colors were required to compete against each other. Robots viewed their environment through a set of sensors (explained below) and responded to the environment by changing their direction and velocity.

Dead robots are represented as hollow circles and are incapable of movement, though these robots are still considered obstacles to other robots. A robot dies when it hits a wall or another robot.

Waypoints represent goals to the robot. If a robot gets sufficiently close to the center of the waypoint, that goal is considered attained. In this simulation, red waypoints are goals to green robots and green waypoints are goals to red robots.

If the goal represented by a waypoint can be satisfied. Then the satisfaction of that goal will cause the waypoint to deactivate and turn dark.

Some waypoints represent complex goals, which require the visitation of multiple robots. An ‘X’ represents a waypoint that has only partially been satisfied.

Robot Sensor Inputs (26 total)

Robots are only aware of locally accessible information. They do not have any global knowledge of the map and do not know the locations of waypoints or other robots unless those objects are sufficiently close and are in the robot’s line-of-sight. If an object is just around a corner, the robot will be unaware of its presence. The robot contains sensors to detect its own speed, the location of obstacles, the location waypoints, and the location of other robots. Sensor specifics are shown below.

1 Speed Sensor: The robot receives an accurate measure of its own speed. Due to the imposition of momentum, this speed will not necessarily equal the speed commands being sent by the robot.

7 Laser Sensors: For computationally efficiency, the laser used in simulation provides less information than a real-world laser. In the real-world, the laser takes up to 361 measurements with 180.5 degrees coverage, providing an angular resolution of 0.5 degrees.
- Activated by walls or other robots
- Provides: Straight-line distance measurement in meters
- Total Coverage: 180 degrees in front of robot
- Resolution: 30 degrees
- Range: 8 meters.

4 Team Robot Sensors: This low-resolution device will not provide an accurate angle to an object, but will provide an accurate distance. The angular resolution is only good enough to determine if the object is in front, behind, left, or right.
- Activated by the presence of same-colored robots
- Provides: Straight-line distance measurement in meters
- Total Coverage: Overlapping coverage of 360 degrees around robot
- Resolution: 120 degrees
- Range: 24 meters.

4 Opponent Robot Sensors:
- Activated by the presence of opposite-colored robots
- Specification match the sensor above

4 Team Waypoint Sensors:
- Activated by the presence of same-colored waypoints
- Specifications match the sensor above

4 Opponent Waypoint Sensors:
- Activated by the presence of opposite-colored waypoints
- Specifications match the sensor above

2 Internal Memory Sensors: The conventional reactive-architecture does not contain any internal state. Robots only respond to their immediate surroundings. Though this can produce effective behaviors, it also limits what a robot can accomplish. In an attempt to offer robots some ability to store current state information, these memory sensors were added. These sensors simply report the values of two output neurons, producing a limited amount of recursion in the neural network.
- Always active
- Provides: A capped approximation of two output values
- Range: Values are capped to a range of -10.0 to 10.0

Robot Output Commands (10 total)

For every iteration of the simulation, robots receive information from their available sensors, processes the data, and produce an output to control their steering and throttle. The robots are only able to issue a limited number of commands to their motors. These are as follows.

5 Steering Angles:
- -15° (right)
- -5° (right)
- 0° (straight)
- 5° (left)
- 15° (left)

3 Speed Velocities:
- 0.02 meters/sec
- 0.1 meters/sec
- 0.5 meters/sec

2 Internal Memory Outputs: Recursively fed to the 2 internal memory sensors

In an attempt to increase the simulation’s realism, commands to the motors do not generate instantaneous changes. Instead, the motors will respond by the amount that is half-way between the robot current state and the requested state. For instance, if the robot is traveling at 0.1 meters/sec and it receives a command to go 0.5 meters/sec, the resulting speed will be .3 meters/sec. After the command has been issued to the motors, the robot’s position will be updated on the map. If the movement placed the robot onto a wall or another obstacle, that robot is considered wrecked and will be disabled for the remainder of the trial.

Robot Learning Algorithm

Robot-learning was accomplished using the “Neuroevolution with enforced sub-populations” (ESP) algorithm (F. Gomez and R. Miikkulainen 1999), (Bryant and Mikkulainen 2003). This project involved a simple fully-connected feed-forward neural network containing an input layer (the robot's sensors), an output layer (the robot's responses), and a single hidden layer of neurons. Each neuron in the hidden layer received a copy of the output from every sensor in the input layer, multiplied by a connection weight (the input weight). The hidden neuron combined these values through simple summation and produces an output value that is similarly multiplied by a connection weight (the output weight) before propagating to the output layer response. During training a genetic algorithm was used to manipulate the input and output weights so that the robot's responses are increasingly appropriate for a given set of inputs. More specifically, the input and output weights of each hidden layer neuron are represented as a one-dimensional chromosome. Weights are manipulated by allowing those chromosomes to undergo random changes (mutation) and information exchange with other individuals in the population (recombination). Chromosome fitness is determined by comparing the success of individuals who possess that chromosome to individuals who do not. Chromosomes with greater fitness are offered more opportunities to recombine during subsequent generations.

The novel feature of the ESP algorithm is that the neural network is not evolved as a complete unit. Instead, the network is divided into individual hidden neurons, which are evolved independently within separate populations. During testing and training of a network with 'n' hidden layer neurons, neurons from populations '1' - 'n' are placed in respective positions '1' - 'n' in the hidden layer. The network is then tested by allowing the simulated robot to interact with its environment for a predetermined amount of time. The fitness for a neuron is determined by averaging the success of all networks for which it contributed. For a neuron to be successful, it must act as a good teammate and be able to increase a robot's chances of survival, regardless of the neurons that it is partnered with. The following figure illustrates the separation of hidden layer neuron populations.

Simulated robots are controlled using a neural network (represented in the left-most image) that is trained using neuroevolution with enforced sub-populations (ESP). The connection weights (represented by gray lines) within each hidden layer neuron (1,2,3,...,n) are evolved within separate population of similar cells (represented in the remaining images). Networks are constructed by randomly selecting one individual from each population and placing it in the corresponding position in the hidden layer.

Application of ESP to robot control

Each robot controller consisted of a fully-connected three-layer neural network similar to the one described above. The network contained 26 input sensors, 10 output commands, and 6 hidden neurons that connected the two. Additionally, the network contained a bias connection in the input and hidden layers that propagated a value of one (multiplied by the corresponding weights) to the downstream neuron.

There were 37 weights associated with each hidden neuron, including weights for the 26 inputs, 1 input-bias, and 10 outputs. These weights were stored as a 37-allele long, one-dimensional chromosome. New robots were given weights that were initialized to random values between -1 and +1 such that there was a high likelihood of values close to zero and an exponentially lower likelihood that values would be close to ±1. Mutation could change the weights from their original values but a restriction was applied that capped the ranges between -1.0 and +1.0. Each of the hidden neuron chromosomes was maintained in a separate sub-population. Six sub-populations contained chromosomes from the 6 hidden neurons and a seventh sub-population contained the chromosome associated with the hidden bias. For the current implementation, each population contained 51 individuals. This number was selected somewhat arbitrarily to produce sufficiently long learning-curve graphs in a tractable amount of time. (Bryant and Mikkulainen 2003) suggested that better results could be obtained using much larger populations (500+ individuals) but this has not yet been investigated for the current project.

Evaluation

During each learning cycle, a neural network was assembled by randomly selecting a neuron chromosome from each of the 7 populations. The network was evaluated by using it to drive the robot in the simulation for a 7500 time-steps trial. In each time-step, sensors sent local environment information to the hidden-layer neurons, which propagated the signal to the outputs that directed the robot's motors. Depending on the command, the motors would maintain speed, or would accelerate or decelerate by an amount that was allowed by the physics of the simulation. The robot would then travel a short distance before the beginning of the next time-step. During the movement, points could be added to or deducted from robot's performance score according to a predefined fitness function. At the completion of the trial, the accumulated score was given to each chromosome that participated in the run. This process was repeated for another randomly selected set of chromosomes until all chromosomes got a chance to participate in 10 simulation trials. The chromosomes final fitness was the average of all 10 evaluation opportunities. Since each hidden-layer cell was taken from a population of 51 individuals, 510 trials were needed to produce a fitness score for the entire population of hidden-neuron chromosomes. It should be mentioned, that, while one robot's chromosomes were being tested, seven other robots with their own independent set of chromosomes were also being tested at the same time, in the same environment. This helped produce a dynamic training environment and resulted in a diverse set of solutions to the training.

Mating

After each round of evaluation trials were completed, neuron chromosomes were mated with others in the same population using an elitism breeding strategy. The one individual with the highest fitness continued to the next generation unchanged. 24 breeding pairs were then selected to produce 24 pairs of offspring for the next generation. Selection was determined using a normalized roulette-wheel strategy. The roulette-wheel provided high-fitness individuals with a high probability of participating in a breeding pair, while low-fitness individuals were given a low probability. During breeding, there was a 75% probability that a single-point crossovers would occur. Random mutations occurred with 1% probability.

To further introduce genetic variability, two “outsider” individuals were added to the resulting population. Depending on the training phase, these individual were either modified versions of the most fit individuals (modified, in such a way that every allele was changed a small amount from the original), or they were extracted from the population of another robot of the same type. When "outsider" individuals were taken from within a population, it allowed the eight different robots in the simulation to evolve independently of one another. When "outsider" individuals were taken from another robot of the same type, it would allow genetic information to be exchanged between robots. The different strategies were applied to different learning scenarios. During initial training of obstacle avoidance behaviors, robots were not allowed to share genetic information with the hope that they would each develop their own strategies and personalities. For all subsequent scenarios, robots on the same team (represented by the same color), were allowed to share their best population members with others on the same team. It was believed that this would help reintroduce beneficial genetic variability into populations that had already converged during the initial obstacle avoidance training.

Experimentation

Robots were placed in six different training scenarios to demonstrate their ability to autonomously learn a variety of tasks. These scenarios were Obstacle-Avoidance, Individual Waypoint-Seeking, Team Waypoint-Seeking, Predator-Prey Interaction, Mutualistic Interaction, and Parasitic Interaction.

Eight robots (each controlled by 7 separate hidden-neuron populations that maintained 51 members) were trained in each scenario for between 100-200 generations. The robots placed in any given scenario were physically identical to those placed in any other scenario. All robots possessed the same set of sensors and produced the same output control, with one exception: A few of the scenarios required a diverse group of individuals. This was accomplished by regulating the top speed of one group of individuals to be slightly slower than other robots. When going into the scenario, the modified individuals had only been trained on their non-modified equipment and had to adapt to the modifications on the fly.

The specifics for each scenario are provided on the following pages, along with a quantitative comparison of each robot's fitness score during the trial, and a qualitative discussion of the observed emergence of behaviors at different stages in training.

References:

Gomez, F., and Miikkulainen, R. (1999). Solving non-Markovian control tasks with neuroevolution. Proceedings of the 16th International Joint Conference on Artificial Intelligence. Denver, CO: Morgan Kaufmann.

Bobby D. Bryant and Risto Miikkulainen (2003). Neuroevolution for Adaptive Teams. Proceedings of the 2003 Congress on Evolutionary Computation (CEC 2003), Vol. 3, pp. 2194-2201. Piscataway, NJ: IEEE Press

Control Simulation