SciPost logo

SciPost Submission Page

The Quantum Cartpole: A benchmark environment for non-linear reinforcement learning

by Kai Meinerz, Simon Trebst, Mark Rudner, Evert van Nieuwenburg

This Submission thread is now published as

Submission summary

Authors (as registered SciPost users): Kai Meinerz
Submission information
Preprint Link:  (pdf)
Date accepted: 2024-04-15
Date submitted: 2024-03-18 12:58
Submitted by: Meinerz, Kai
Submitted to: SciPost Physics Core
Ontological classification
Academic field: Physics
  • Quantum Physics


Feedback-based control is the de-facto standard when it comes to controlling classical stochastic systems and processes. However, standard feedback-based control methods are challenged by quantum systems due to measurement induced backaction and partial observability. Here we remedy this by using weak quantum measurements and model-free reinforcement learning agents to perform quantum control. By comparing control algorithms with and without state estimators to stabilize a quantum particle in an unstable state near a local potential energy maximum, we show how a trade-off between state estimation and controllability arises. For the scenario where the classical analogue is highly nonlinear, the reinforcement learned controller has an advantage over the standard controller. Additionally, we demonstrate the feasibility of using transfer learning to develop a quantum control agent trained via reinforcement learning on a classical surrogate of the quantum control problem. Finally, we present results showing how the reinforcement learning control strategy differs from the classical controller in the non-linear scenarios.

Author comments upon resubmission

We extend our sincere thanks to both referees for their thorough review of our paper and for providing valuable feedback. Here we respond to the referees comments and list the changes made.

Referee 1:
Comment: The authors do not explain the criteria they employ for assessing the performance. This concerns Fig. 3, 4, and 6 (Here t_termination, not defined, appears on another unspecified scale, different from the one of Fig. 3 and 4). I could not find any explanation of the white line in Fig. 4.

Reply: We have added a detailed description of t_termination on page 5 and added an explanation of the white line in the caption of Fig. 4. For Fig. 6 we have chosen a relative scale weighing the different controllers against the LQGC to highlight the comparative performance, independent from the maximum performance depending on the underlying potential. We have added this explanation for the chosen scale on p.8.

Referee 2:
Comment: At the beginning of page 4 they use dt and Δt and it is not clear if it is a typo or if they are different. Related to this, it is not clear if, when changing Nmeas, the time between each measurement is proportionally varied (and thus the time between each control is constant) or if it is kept constant (and thus the time between each control changes)

Reply: We have fixed the typo on page 4 and changed dt to Δt. Also we have extended the explanation of N_meas, clarifying that Δt remains constant, even when increasing N_meas.

Comment: It is not clear how the estimator agent is then used in the control agent. This detail is very important in order to have a fair comparison between the performance of the control agents with and without state estimators. In fact, it is important to clarify that the "control agent + estimator agent" do not have access to more information about the system than the "control agent" alone. Otherwise the improvement in performance is obvious.

Reply: The difference in information between the RLC and the RLE is that the RLE additionally uses the previously estimated state and the control force used. The difference in the information provided is the same as for the LQR compared to the Kalman filter. The subsequent increase in performance shown in Fig. 3b) can therefore be expected for N_meas = 1. For N_meas > 1 it should also be noted the framestacking of the measurement has to be taken into consideration. Therefore, the figure also highlights the interaction between the simultaneous use of estimators and framestacking, and shows that framestacking plus a simple controller could already be used as a simple quantum control technique. Additionally, Fig. 3 shows that learning the estimator model is more likely to be the bottleneck, as the RLC can exhibit the same or better performance than the LQR. We have expanded the interpretation of Fig. 3 on p.7.

List of changes

1. We have added the meaning of the white line in Fig. 4 to the caption of the figure.
2. In the middle of page 8, we have added an explanation of the scale chosen for Fig. 6.
3. Changed a typo from dt to Δt at the beginning of page 4.
4.In the middle of page 4, we have expanded the description of the quantum cartpole setup to clarify that Δt remains constant.
5. We have extended the interpretation of the results of Fig. 3 on page 7.

Published as SciPost Phys. Core 7, 026 (2024)

Reports on this Submission

Anonymous Report 1 on 2024-4-11 (Invited Report)


The authors have replied satisfactorily to my comments and I recommend publication of the manuscript in the present form in SciPost Physics Core.


Publish (easily meets expectations and criteria for this Journal; among top 50%)

  • validity: high
  • significance: good
  • originality: good
  • clarity: high
  • formatting: excellent
  • grammar: excellent

Login to report or comment