Skip to main content

2024 | OriginalPaper | Buchkapitel

Bag of Policies for Distributional Deep Exploration

verfasst von : Asen Nachkov, Luchen Li, Giulia Luise, Filippo Valdettaro, A. Aldo Faisal

Erschienen in: Epistemic Uncertainty in Artificial Intelligence

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Efficient exploration in complex environments remains a major challenge for reinforcement learning (RL). Compared to previous Thompson sampling-inspired mechanisms that enable temporally extended exploration, i.e., deep exploration, we focus on deep exploration in distributional RL. We develop a general purpose approach, Bag of Policies (BoP), that can be built on top of any return distribution estimator by maintaining a population of its copies. BoP consists of an ensemble of multiple heads that are updated independently. During training, each episode is controlled by only one of the heads and the collected state-action pairs are used to update all heads off-policy, leading to distinct learning signals for each head which diversify learning and behaviour. To test whether optimistic ensemble method can improve on distributional RL as it does on scalar RL, we implement the BoP approach with a population of distributional actor-critics using Bayesian Distributional Policy Gradients (BDPG). The population thus approximates a posterior distribution of return distributions along with a posterior distribution of policies. Our setup allows to analyze global posterior uncertainty along with local curiosity bonus simultaneously for exploration. As BDPG is already an optimistic method, this pairing helps to investigate the extent to which accumulating curiosity bonuses is beneficial. Overall BoP results in greater robustness and speed during learning as demonstrated by our experimental results on ALE Atari games.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Barth-Maron, G., et al.: Distributed distributional deterministic policy gradients. In: Proceedings of the 6th International Conference on Learning Representations (ICLR) (2018) Barth-Maron, G., et al.: Distributed distributional deterministic policy gradients. In: Proceedings of the 6th International Conference on Learning Representations (ICLR) (2018)
Zurück zum Zitat Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013)CrossRef Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013)CrossRef
Zurück zum Zitat Bellemare, M.G., Dabney, W., Munos, R.: A distributional perspective on reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 449–458 (2017) Bellemare, M.G., Dabney, W., Munos, R.: A distributional perspective on reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 449–458 (2017)
Zurück zum Zitat Chen, R.Y., Sidor, S., Abbeel, P., Schulman, J.: UCB exploration via q-ensembles (2017) Chen, R.Y., Sidor, S., Abbeel, P., Schulman, J.: UCB exploration via q-ensembles (2017)
Zurück zum Zitat Choi, Y., Lee, K., Oh, S.: Distributional deep reinforcement learning with a mixture of Gaussians. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 9791–9797 (2019) Choi, Y., Lee, K., Oh, S.: Distributional deep reinforcement learning with a mixture of Gaussians. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 9791–9797 (2019)
Zurück zum Zitat Dabney, W., Ostrovski, G., Silver, D., Munos, R.: Implicit quantile networks for distributional reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 1096–1105 (2018a) Dabney, W., Ostrovski, G., Silver, D., Munos, R.: Implicit quantile networks for distributional reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 1096–1105 (2018a)
Zurück zum Zitat Dabney, W., Rowland, M., Bellemare, M.G., Munos, R.: Distributional reinforcement learning with quantile regression. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018b) Dabney, W., Rowland, M., Bellemare, M.G., Munos, R.: Distributional reinforcement learning with quantile regression. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018b)
Zurück zum Zitat Doan, T., Mazoure, B., Lyle, C.: GAN q-learning (2018) Doan, T., Mazoure, B., Lyle, C.: GAN q-learning (2018)
Zurück zum Zitat Donahue, J., Krähenbühl, P., Darrell, T.: Adversarial feature learning. In: Proceedings of the 5th International Conference on Learning Representations (ICLR) (2017) Donahue, J., Krähenbühl, P., Darrell, T.: Adversarial feature learning. In: Proceedings of the 5th International Conference on Learning Representations (ICLR) (2017)
Zurück zum Zitat Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. CRC Press, Boca Raton (1994)CrossRef Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. CRC Press, Boca Raton (1994)CrossRef
Zurück zum Zitat Espeholt, L., et al.:. IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 1407–1416, Stockholmsmässan, Stockholm (2018) Espeholt, L., et al.:. IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 1407–1416, Stockholmsmässan, Stockholm (2018)
Zurück zum Zitat Freirich, D., Shimkin, T., Meir, R., Tamar, A.: Distributional multivariate policy evaluation and exploration with the bellman GAN. In: Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, vol. 97, pp. 1983–1992 (2019) Freirich, D., Shimkin, T., Meir, R., Tamar, A.: Distributional multivariate policy evaluation and exploration with the bellman GAN. In: Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, vol. 97, pp. 1983–1992 (2019)
Zurück zum Zitat Kuznetsov, A., Shvechikov, P., Grishin, A., Vetrov, D.: Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In: Proceedings of the 37th International Conference on Machine Learning (2020) Kuznetsov, A., Shvechikov, P., Grishin, A., Vetrov, D.: Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In: Proceedings of the 37th International Conference on Machine Learning (2020)
Zurück zum Zitat Li, L., Faisal, A.: Bayesian distributional policy gradients. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 1, pp. 8429–8437 (2021) Li, L., Faisal, A.: Bayesian distributional policy gradients. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 1, pp. 8429–8437 (2021)
Zurück zum Zitat Liang, J., Makoviychuk, V., Handa, A., Chentanez, N., Macklin, M., Fox, D.: GPU-accelerated robotic simulation for distributed reinforcement learning. In: Conference on Robot Learning, pp. 270–282. PMLR (2018) Liang, J., Makoviychuk, V., Handa, A., Chentanez, N., Macklin, M., Fox, D.: GPU-accelerated robotic simulation for distributed reinforcement learning. In: Conference on Robot Learning, pp. 270–282. PMLR (2018)
Zurück zum Zitat Lyle, C., Bellemare, M.G., Castro, P.S.: A comparative analysis of expected and distributional reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4504–4511 (2019) Lyle, C., Bellemare, M.G., Castro, P.S.: A comparative analysis of expected and distributional reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4504–4511 (2019)
Zurück zum Zitat Martin, J., Lyskawinski, M., Li, X., Englot, B.: Stochastically dominant distributional reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning (2020) Martin, J., Lyskawinski, M., Li, X., Englot, B.: Stochastically dominant distributional reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning (2020)
Zurück zum Zitat Mavrin, B., et al.: Distributional reinforcement learning for efficient exploration. In: Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 4424–4434 (2019) Mavrin, B., et al.: Distributional reinforcement learning for efficient exploration. In: Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 4424–4434 (2019)
Zurück zum Zitat Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Proceedings of The 33rd International Conference on Machine Learning, vol. 48, pp. 1928–1937 (2016) Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Proceedings of The 33rd International Conference on Machine Learning, vol. 48, pp. 1928–1937 (2016)
Zurück zum Zitat O’Donoghue, B., Osband, I., Munos, R., Mnih, V.: The uncertainty Bellman equation and exploration. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 3839–3848. Stockholmsmässan, Stockholm (2018) O’Donoghue, B., Osband, I., Munos, R., Mnih, V.: The uncertainty Bellman equation and exploration. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 3839–3848. Stockholmsmässan, Stockholm (2018)
Zurück zum Zitat Osband, I., Blundell, C., Pritzel, A., Van Roy, B.: Deep exploration via Bootstrapped DQN. Adv. Neural. Inf. Process. Syst. 29, 4026–4034 (2016) Osband, I., Blundell, C., Pritzel, A., Van Roy, B.: Deep exploration via Bootstrapped DQN. Adv. Neural. Inf. Process. Syst. 29, 4026–4034 (2016)
Zurück zum Zitat Osband, I., Van Roy, B., Russo, D.J., Wen, Z.: Deep exploration via randomized value functions. J. Mach. Learn. Res. 20, 1–62 (2019)MathSciNet Osband, I., Van Roy, B., Russo, D.J., Wen, Z.: Deep exploration via randomized value functions. J. Mach. Learn. Res. 20, 1–62 (2019)MathSciNet
Zurück zum Zitat Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Hoboken (1994)CrossRef Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Hoboken (1994)CrossRef
Zurück zum Zitat Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: Proceedings of the 4th International Conference on Learning Representations (ICLR) (2016) Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: Proceedings of the 4th International Conference on Learning Representations (ICLR) (2016)
Zurück zum Zitat Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:​1707.​06347 (2017)
Zurück zum Zitat Singh, R., Lee, K., Chen, Y.: Sample-based distributional policy gradient (2020) Singh, R., Lee, K., Chen, Y.: Sample-based distributional policy gradient (2020)
Zurück zum Zitat Sutton, R.S.: Policy gradient methods for reinforcement learning with function approximation. Adv. Neural. Inf. Process. Syst. 12, 1057–1063 (1999) Sutton, R.S.: Policy gradient methods for reinforcement learning with function approximation. Adv. Neural. Inf. Process. Syst. 12, 1057–1063 (1999)
Zurück zum Zitat Tang, Y., Agrawal, S.: Exploration by distributional reinforcement learning. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2710–2716 (2020) Tang, Y., Agrawal, S.: Exploration by distributional reinforcement learning. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2710–2716 (2020)
Zurück zum Zitat Wiering, M.A., van Hasselt, H.P.: Ensemble algorithms in reinforcement learning. IEEE Trans. Syst. Man Cybern. Part B 38(4), 930–936 (2008)CrossRef Wiering, M.A., van Hasselt, H.P.: Ensemble algorithms in reinforcement learning. IEEE Trans. Syst. Man Cybern. Part B 38(4), 930–936 (2008)CrossRef
Zurück zum Zitat Zhang, Z., Chen, J., Chen, Z., Li, W.: Asynchronous episodic deep deterministic policy gradient: toward continuous control in computationally complex environments. IEEE Trans. Cybern. 51, 604–613 (2019)CrossRef Zhang, Z., Chen, J., Chen, Z., Li, W.: Asynchronous episodic deep deterministic policy gradient: toward continuous control in computationally complex environments. IEEE Trans. Cybern. 51, 604–613 (2019)CrossRef
Metadaten
Titel
Bag of Policies for Distributional Deep Exploration
verfasst von
Asen Nachkov
Luchen Li
Giulia Luise
Filippo Valdettaro
A. Aldo Faisal
Copyright-Jahr
2024
DOI
https://doi.org/10.1007/978-3-031-57963-9_3

Premium Partner