Skip to main content

2024 | OriginalPaper | Buchkapitel

Towards Offline Reinforcement Learning with Pessimistic Value Priors

verfasst von : Filippo Valdettaro, A. Aldo Faisal

Erschienen in: Epistemic Uncertainty in Artificial Intelligence

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Offline reinforcement learning (RL) seeks to train agents in sequential decision-making tasks using only previously collected data and without directly interacting with the environment. As the agent tries to improve on the policy present in the dataset, it can introduce distributional shift between the training data and the suggested agent’s policy which can lead to poor performance. To avoid the agent assigning high values to out-of-distribution actions, successful offline RL requires some form of conservatism to be introduced. Here we present a model-free inference framework that encodes this conservatism in the prior belief of the value function: by carrying out policy evaluation with a pessimistic prior, we ensure that only the actions that are directly supported by the offline dataset will be modelled as having a high value. In contrast to other methods, we do not need to introduce heuristic policy constraints, value regularisation or uncertainty penalties to achieve successful offline RL policies in a toy environment. An additional consequence of our work is a principled quantification of Bayesian uncertainty in off-policy returns in model-free RL. While we are able to present an implementation of this framework to verify its behaviour in the exact inference setting with Gaussian processes on a toy problem, the scalability issues that it suffers as the central avenue for further work. We address in more detail these limitations and consider future directions to improve the scalability of this framework beyond the vanilla Gaussian process implementation, proposing a path towards improving offline RL algorithms in a principled way.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat An, G., Moon, S., Kim, J.-H., Song, H.O.: Uncertainty-based offline reinforcement learning with diversified q-ensemble. In: Advances in Neural Information Processing Systems, vol. 34, pp. 7436–7447 (2021) An, G., Moon, S., Kim, J.-H., Song, H.O.: Uncertainty-based offline reinforcement learning with diversified q-ensemble. In: Advances in Neural Information Processing Systems, vol. 34, pp. 7436–7447 (2021)
Zurück zum Zitat Bachtiger, P., et al.: Artificial intelligence, data sensors and interconnectivity: future opportunities for heart failure. Card. Fail. Rev. 6 (2020) Bachtiger, P., et al.: Artificial intelligence, data sensors and interconnectivity: future opportunities for heart failure. Card. Fail. Rev. 6 (2020)
Zurück zum Zitat Brandfonbrener, D., Whitney, W., Ranganath, R., Bruna, J.: Offline RL without off-policy evaluation. In: Advances in Neural Information Processing Systems, vol. 34, pp. 4933–4946 (2021) Brandfonbrener, D., Whitney, W., Ranganath, R., Bruna, J.: Offline RL without off-policy evaluation. In: Advances in Neural Information Processing Systems, vol. 34, pp. 4933–4946 (2021)
Zurück zum Zitat Burt, D.R., Ober, S.W., Garriga-Alonso, A., van der Wilk, M.: Understanding variational inference in function-space. arXiv preprint: arXiv:2011.09421 (2020) Burt, D.R., Ober, S.W., Garriga-Alonso, A., van der Wilk, M.: Understanding variational inference in function-space. arXiv preprint: arXiv:​2011.​09421 (2020)
Zurück zum Zitat Dasari, S., et al.: RoboNet: Large-scale multi-robot learning. In: Conference on Robot Learning, pp. 885–897. PMLR (2020) Dasari, S., et al.: RoboNet: Large-scale multi-robot learning. In: Conference on Robot Learning, pp. 885–897. PMLR (2020)
Zurück zum Zitat Degris, T., White, M., Sutton, R.S.: Off-policy actor-critic. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012 (2012) Degris, T., White, M., Sutton, R.S.: Off-policy actor-critic. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012 (2012)
Zurück zum Zitat D’Angelo, F., Fortuin, V.: Repulsive deep ensembles are Bayesian. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 3451–3465. Curran Associates, Inc. (2021) D’Angelo, F., Fortuin, V.: Repulsive deep ensembles are Bayesian. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 3451–3465. Curran Associates, Inc. (2021)
Zurück zum Zitat Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 201–208 (2005) Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 201–208 (2005)
Zurück zum Zitat Fujimoto, S., Gu, S.S.: A minimalist approach to offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 20132–20145 (2021) Fujimoto, S., Gu, S.S.: A minimalist approach to offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 20132–20145 (2021)
Zurück zum Zitat Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: International Conference on Machine Learning, pp. 2052–2062. PMLR (2019) Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: International Conference on Machine Learning, pp. 2052–2062. PMLR (2019)
Zurück zum Zitat Hafner, D., Tran, D., Lillicrap, T., Irpan, A., Davidson, J.: Noise contrastive priors for functional uncertainty. In: Uncertainty in Artificial Intelligence, pp. 905–914. PMLR (2020) Hafner, D., Tran, D., Lillicrap, T., Irpan, A., Davidson, J.: Noise contrastive priors for functional uncertainty. In: Uncertainty in Artificial Intelligence, pp. 905–914. PMLR (2020)
Zurück zum Zitat Huang, Z., Wu, J., Lv, C.: Efficient deep reinforcement learning with imitative expert priors for autonomous driving. IEEE Trans. Neural Netw. Learn. Syst. (2022) Huang, Z., Wu, J., Lv, C.: Efficient deep reinforcement learning with imitative expert priors for autonomous driving. IEEE Trans. Neural Netw. Learn. Syst. (2022)
Zurück zum Zitat Kalashnikov, D., et al.: Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, pp. 651–673. PMLR (2018) Kalashnikov, D., et al.: Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, pp. 651–673. PMLR (2018)
Zurück zum Zitat Kendall, A., et a.: Learning to drive in a day. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 8248–8254. IEEE (2019) Kendall, A., et a.: Learning to drive in a day. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 8248–8254. IEEE (2019)
Zurück zum Zitat Kidambi, R., Rajeswaran, A., Netrapalli, P., Joachims, T.: MOReL: model-based offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 21810–21823 (2020) Kidambi, R., Rajeswaran, A., Netrapalli, P., Joachims, T.: MOReL: model-based offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 21810–21823 (2020)
Zurück zum Zitat Komorowski, M., Celi, L.A., Badawi, O., Gordon, A.C., Faisal, A.A.: The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nat. Med. 24(11), 1716–1720 (2018)CrossRef Komorowski, M., Celi, L.A., Badawi, O., Gordon, A.C., Faisal, A.A.: The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nat. Med. 24(11), 1716–1720 (2018)CrossRef
Zurück zum Zitat Kostrikov, I., Fergus, R., Tompson, J., Nachum, O.: Offline reinforcement learning with fisher divergence critic regularization. In: International Conference on Machine Learning, pp. 5774–5783. PMLR (2021) Kostrikov, I., Fergus, R., Tompson, J., Nachum, O.: Offline reinforcement learning with fisher divergence critic regularization. In: International Conference on Machine Learning, pp. 5774–5783. PMLR (2021)
Zurück zum Zitat Kostrikov, I., Nair, A., Levine, S.: Offline reinforcement learning with implicit Q-learning. In: International Conference on Learning Representations (2022) Kostrikov, I., Nair, A., Levine, S.: Offline reinforcement learning with implicit Q-learning. In: International Conference on Learning Representations (2022)
Zurück zum Zitat Kumar, A., Fu, J., Soh, M., Tucker, G., Levine, S.: Stabilizing off-policy Q-learning via bootstrapping error reduction. In: Advances in Neural Information Processing Systems, vol. 32 (2019) Kumar, A., Fu, J., Soh, M., Tucker, G., Levine, S.: Stabilizing off-policy Q-learning via bootstrapping error reduction. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Zurück zum Zitat Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1179–1191 (2020) Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1179–1191 (2020)
Zurück zum Zitat Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing Systems, vol. 30 (2017) Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Zurück zum Zitat Ma, C., Hernández-Lobato, J.M.: Functional variational inference based on stochastic process generators. In: Advances in Neural Information Processing Systems, vol. 34, pp. 21795–21807 (2021) Ma, C., Hernández-Lobato, J.M.: Functional variational inference based on stochastic process generators. In: Advances in Neural Information Processing Systems, vol. 34, pp. 21795–21807 (2021)
Zurück zum Zitat Matsushima, T., Furuta, H., Matsuo, Y., Nachum, O., Gu, S.: Deployment-efficient reinforcement learning via model-based offline optimization. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net (2021) Matsushima, T., Furuta, H., Matsuo, Y., Nachum, O., Gu, S.: Deployment-efficient reinforcement learning via model-based offline optimization. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net (2021)
Zurück zum Zitat Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)CrossRef Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)CrossRef
Zurück zum Zitat Ober, S.W., Rasmussen, C.E., van der Wilk, M.: The promises and pitfalls of deep kernel learning. In: Uncertainty in Artificial Intelligence, pp. 1206–1216. PMLR (2021) Ober, S.W., Rasmussen, C.E., van der Wilk, M.: The promises and pitfalls of deep kernel learning. In: Uncertainty in Artificial Intelligence, pp. 1206–1216. PMLR (2021)
Zurück zum Zitat Osband, I., Aslanides, J., Cassirer, A.: Randomized prior functions for deep reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 31 (2018) Osband, I., Aslanides, J., Cassirer, A.: Randomized prior functions for deep reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Zurück zum Zitat Ovadia, Y., et al.: Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In: Advances in Neural Information Processing Systems, vol. 32, (2019) Ovadia, Y., et al.: Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In: Advances in Neural Information Processing Systems, vol. 32, (2019)
Zurück zum Zitat Rasmussen, C.E., et al.: Gaussian Processes for Machine Learning, vol. 1. Springer, Cham (2006) Rasmussen, C.E., et al.: Gaussian Processes for Machine Learning, vol. 1. Springer, Cham (2006)
Zurück zum Zitat Shi, L., Li, G., Wei, Y., Chen, Y., Chi, Y.: Pessimistic Q-learning for offline reinforcement learning: towards optimal sample complexity. In: International Conference on Machine Learning, pp. 19967–20025. PMLR (2022) Shi, L., Li, G., Wei, Y., Chen, Y., Chi, Y.: Pessimistic Q-learning for offline reinforcement learning: towards optimal sample complexity. In: International Conference on Machine Learning, pp. 19967–20025. PMLR (2022)
Zurück zum Zitat Sinha, S., Mandlekar, A., Garg, A.: S4rl: surprisingly simple self-supervision for offline reinforcement learning in robotics. In: Conference on Robot Learning, pp. 907–917. PMLR (2022) Sinha, S., Mandlekar, A., Garg, A.: S4rl: surprisingly simple self-supervision for offline reinforcement learning in robotics. In: Conference on Robot Learning, pp. 907–917. PMLR (2022)
Zurück zum Zitat Sun, S., Zhang, G., Shi, J., Grosse, R.: Functional variational Bayesian neural networks. arXiv preprint: arXiv:1903.05779 (2019) Sun, S., Zhang, G., Shi, J., Grosse, R.: Functional variational Bayesian neural networks. arXiv preprint: arXiv:​1903.​05779 (2019)
Zurück zum Zitat Titsias, M.: Variational learning of inducing variables in sparse Gaussian processes. In: Artificial Intelligence and Statistics, pp. 567–574. PMLR (2009) Titsias, M.: Variational learning of inducing variables in sparse Gaussian processes. In: Artificial Intelligence and Statistics, pp. 567–574. PMLR (2009)
Zurück zum Zitat Touati, A., Satija, H., Romoff, J., Pineau, J., Vincent, P.: Randomized value functions via multiplicative normalizing flows. In: Uncertainty in Artificial Intelligence, pp. 422–432. PMLR (2020) Touati, A., Satija, H., Romoff, J., Pineau, J., Vincent, P.: Randomized value functions via multiplicative normalizing flows. In: Uncertainty in Artificial Intelligence, pp. 422–432. PMLR (2020)
Zurück zum Zitat Van Amersfoort, J., Smith, L., Jesson, A., Key, O., Gal, Y.: On feature collapse and deep kernel learning for single forward pass uncertainty. arXiv preprint: arXiv:2102.11409 (2021) Van Amersfoort, J., Smith, L., Jesson, A., Key, O., Gal, Y.: On feature collapse and deep kernel learning for single forward pass uncertainty. arXiv preprint: arXiv:​2102.​11409 (2021)
Zurück zum Zitat Wilson, A.G., Hu, Z., Salakhutdinov, R., Xing, E.P.: Deep kernel learning. In: Artificial Intelligence and Statistics, pp. 370–378. PMLR (2016) Wilson, A.G., Hu, Z., Salakhutdinov, R., Xing, E.P.: Deep kernel learning. In: Artificial Intelligence and Statistics, pp. 370–378. PMLR (2016)
Zurück zum Zitat Xiao, T., Wang, D.: A general offline reinforcement learning framework for interactive recommendation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 4512–4520 (2021) Xiao, T., Wang, D.: A general offline reinforcement learning framework for interactive recommendation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 4512–4520 (2021)
Zurück zum Zitat Yu, T., et al.: MOPO: model-based offline policy optimization. In: Advances in Neural Information Processing Systems, vol. 33, pp. 14129–14142 (2020) Yu, T., et al.: MOPO: model-based offline policy optimization. In: Advances in Neural Information Processing Systems, vol. 33, pp. 14129–14142 (2020)
Metadaten
Titel
Towards Offline Reinforcement Learning with Pessimistic Value Priors
verfasst von
Filippo Valdettaro
A. Aldo Faisal
Copyright-Jahr
2024
DOI
https://doi.org/10.1007/978-3-031-57963-9_7

Premium Partner