Approximating Gradients for
Differentiable Quality Diversity
in Reinforcement Learning

Bryon Tjanaka

Oral Qualification Exam, 14 April 2022

Committee

  • Stefanos Nikolaidis (Chair)
  • Satyandra K. Gupta
  • Sven Koenig
  • Haipeng Luo
  • Gaurav Sukhatme

Learning robust behaviors?

Quality diversity optimization can help.

  1. Quality Diversity (QD)

  2. Differentiable Quality Diversity
    (DQD)

  3. Quality Diversity for Reinforcement Learning (QD-RL)

  4. Approximating Gradients
    for DQD in RL

  5. Experiments

  6. Results

  1. Quality Diversity (QD)

  2. Differentiable Quality Diversity
    (DQD)

  3. Quality Diversity for Reinforcement Learning (QD-RL)

  4. Approximating Gradients
    for DQD in RL

  5. Experiments

  6. Results

Performance: 2,300
Front: 40%
Back: 50%

QD Objective

For every output $x$ of the measure function $\bm{m}$, find $\bm{\phi}$ such that $\bm{m}(\bm{\phi}) = x$, and $f(\bm{\phi})$ is maximized.

Background: MAP-Elites

  1. Evaluate initial random solutions.
  2. Select solution $\bm{\phi}$ from the archive.
  3. Mutate $\bm{\phi}$ with Gaussian noise.
  4. Evaluate $\bm{\phi}'$ and insert into archive.
  5. Go to step 2.

Key Insight: Stepping stones

A. Cully et al. 2015, "Robots that can adapt like animals." Nature 2015.

J.-B. Mouret and J. Clune 2015, "Illuminating search spaces by mapping elites." https://arxiv.org/abs/1504.04909

  1. Quality Diversity (QD)

  2. Differentiable Quality Diversity
    (DQD)

  3. Quality Diversity for Reinforcement Learning (QD-RL)

  4. Approximating Gradients
    for DQD in RL

  5. Experiments

  6. Results

CMA-MEGA

Key Insight: Search by following objective and measure gradients.

Fontaine and Nikolaidis 2021, "Differentiable Quality Diversity." NeurIPS 2021 Oral.

CMA-MEGA

CMA-MEGA

CMA-MEGA

  1. Quality Diversity (QD)

  2. Differentiable Quality Diversity
    (DQD)

  3. Quality Diversity for Reinforcement Learning (QD-RL)

  4. Approximating Gradients
    for DQD in RL

  5. Experiments

  6. Results

Reinforcement Learning

Policy

$$\pi_{\bm{\phi}}(a | s)$$

Expected Discounted Return

$$f(\bm{\phi}) = \mathbb{E}_{\xi\sim p_{\bm{\phi}}}\left[\sum_{t=0}^T\gamma^t r(s_t,a_t) \right]$$

Policy Gradient Assisted MAP-Elites
(PGA-MAP-Elites)

O. Nilsson and A. Cully 2021. "Policy Gradient Assisted MAP-Elites." GECCO 2021.

Policy Gradient Assisted MAP-Elites
(PGA-MAP-Elites)

O. Nilsson and A. Cully 2021. "Policy Gradient Assisted MAP-Elites." GECCO 2021.

Policy Gradient Assisted MAP-Elites
(PGA-MAP-Elites)

O. Nilsson and A. Cully 2021. "Policy Gradient Assisted MAP-Elites." GECCO 2021.

Policy Gradient Assisted MAP-Elites
(PGA-MAP-Elites)

O. Nilsson and A. Cully 2021. "Policy Gradient Assisted MAP-Elites." GECCO 2021.

Policy Gradient Assisted MAP-Elites
(PGA-MAP-Elites)

O. Nilsson and A. Cully 2021. "Policy Gradient Assisted MAP-Elites." GECCO 2021.

Inspired by
PGA-MAP-Elites!

  1. Quality Diversity (QD)

  2. Differentiable Quality Diversity
    (DQD)

  3. Quality Diversity for Reinforcement Learning (QD-RL)

  4. Approximating Gradients
    for DQD in RL

  5. Experiments

  6. Results

Hypothesis:

Since CMA-MEGA performs well in DQD domains,
it will outperform existing QD-RL algorithms
(i.e. PGA-MAP-Elites and MAP-Elites).

DQDQD-RL

Exact Gradients

CMA-MEGA

CMA-MEGA?

Problem: Environments are non-differentiable!

Solution: Approximate $\bm{\nabla} f$ and $\bm{\nabla m}$.

DQDQD-RL

Exact Gradients

Approximate Gradients

CMA-MEGA

CMA-MEGA with
gradient approximations

Approximating $\bm{\nabla} f$


Expected discounted return

Off-Policy Actor-Critic Method (TD3)

S. Fujimoto et al. 2018, "Addressing Function Approximation error in Actor-Critic Methods." ICML 2018.

Evolution Strategy (OpenAI-ES)

Salimans et al. 2017, "Evolution Strategies as a Scalable Alternative to Reinforcement Learning" https://arxiv.org/abs/1703.03864

Evolution Strategy (OpenAI-ES)

Approximating $\bm{\nabla} \bm{m}$


Black Box

CMA-MEGA (ES)CMA-MEGA (TD3, ES)
$\bm{\nabla} f$ ES TD3
$\bm{\nabla} \bm{m}$ ES ES

CMA-MEGA (ES) & CMA-MEGA (TD3, ES)

CMA-MEGA (ES) & CMA-MEGA (TD3, ES)

CMA-MEGA (ES) & CMA-MEGA (TD3, ES)

CMA-MEGA (ES) & CMA-MEGA (TD3, ES)

CMA-MEGA (ES) & CMA-MEGA (TD3, ES)

CMA-MEGA (ES) & CMA-MEGA (TD3, ES)

  1. Quality Diversity (QD)

  2. Differentiable Quality Diversity
    (DQD)

  3. Quality Diversity for Reinforcement Learning (QD-RL)

  4. Approximating Gradients
    for DQD in RL

  5. Experiments

  6. Results

QDGym

  • Objective: Walk forward
  • Measures: Foot contact time

QD Ant

QD Half-Cheetah

QD Hopper

QD Walker

Parameters

  • 1M evaluations
  • Archive of 1000 cells
  • 3-layer policy

Independent Variables

  • Algorithm:
    CMA-MEGA (ES), CMA-MEGA (TD3, ES),
    PGA-MAP-Elites, MAP-Elites, ME-ES
  • Environment:
    QD Ant, QD Half-Cheetah, QD Hopper, QD Walker

Dependent Variable

  • QD Score
  1. Quality Diversity (QD)

  2. Differentiable Quality Diversity
    (DQD)

  3. Quality Diversity for Reinforcement Learning (QD-RL)

  4. Approximating Gradients
    for DQD in RL

  5. Experiments

  6. Results

QD Ant

Best

2 legs

3 legs

QD Half-Cheetah

Best

Back foot

Front foot

QD Hopper

Best

High contact

Low contact

QD Walker

Best

Favoring one foot

CMA-MEGA (ES)CMA-MEGA (TD3, ES)
PGA-MAP-ElitesComparable on 2/4Comparable on 4/4
MAP-ElitesOutperforms on 4/4Outperforms on 4/4
ME-ESOutperforms on 3/4Outperforms on 4/4

Inspired by
PGA-MAP-Elites!

CMA-MEGA

(DQD Benchmark Domain)

Easy objective, difficult measures

CMA-MEGA (ES)

(QD Half-Cheetah)

Difficult objective, easy measures

PGA-MAP-ElitesCMA-MEGA (ES),
CMA-MEGA (TD3, ES)
Objective Gradient
Steps
5,000,0005,000

Future Directions

Future Directions

Additional Projects

pyribs

A bare-bones Python library for quality diversity optimization.

105 GitHub stars

B. Tjanaka et al. 2021, "pyribs: A bare-bones Python library for quality diversity optimization." https://github.com/icaros-usc/pyribs.

Lunar Lander Tutorial

https://lunar-lander.pyribs.org

On the Importance of Environments in
Human-Robot Collaboration

M. C. Fontaine*, Y.-C. Hsu*, Y. Zhang*, B. Tjanaka, S. Nikolaidis. "On the Importance of Environments in Human-Robot Collaboration." RSS 2021.

Potential Future Projects

Learning collaborative strategies.

QD-RL for real-world
robots.

Enhance and publish
pyribs.

Approximating Gradients for Differentiable Quality Diversity in
Reinforcement Learning

Bryon Tjanaka

Oral Qualification Exam, 14 April 2022

  1. Quality Diversity (QD)

  2. Differentiable Quality Diversity
    (DQD)

  3. Quality Diversity for Reinforcement Learning (QD-RL)

  4. Approximating Gradients
    for DQD in RL

  5. Experiments

  6. Results

Supplemental

DQD Benchmark

$$sphere(\bm{\phi}) = \sum_{i=1}^n (\bm{\phi}_i-2.048)^2$$ $$clip(\bm{\phi}_i) = \begin{cases} \bm{\phi}_i & \text{if } -5.12 \le \bm{\phi}_i \le 5.12 \\ 5.12/\bm{\phi_i} & \text{otherwise} \end{cases}$$ $$\bm{m}(\bm{\phi}) = \left(\sum_{i=1}^{\lfloor\frac{n}{2}\rfloor} clip(\bm{\phi}_i), \sum_{i=\lfloor\frac{n}{2}\rfloor+1}^n clip(\bm{\phi}_i) \right)$$

Fontaine et al. 2020, "Covariance Matrix Adaptation for the Rapid Illumination of Behavior Space." GECCO 2020.

DQD StyleGAN+CLIP

Fontaine and Nikolaidis 2021, "Differentiable Quality Diversity." NeurIPS 2021 Oral.

OpenAI-ES

$$\bm{\nabla} f(\bm{\phi}) \approx \frac{1}{\lambda_{es}\sigma} \sum_{i=1}^{\lambda_{es}} f(\bm{\phi} + \sigma \bm{\epsilon}_i) \bm{\epsilon}_i$$

CVT Archive

Vassiliades et al. 2018. "Using Centroidal Voronoi Tessellations to Scale Up the Multidimensional Archive of Phenotypic Elites Algorithm." IEEE Transactions on Evolutionary Computation 2018.

Sliding Boundaries Archive

Fontaine et al. 2019. "Mapping Hearthstone Deck Spaces through MAP-Elites with Sliding Boundaries." GECCO 2019.

Back to Supplemental