The Quantum Cartpole: A benchmark environment for non-linear reinforcement learning

Feedback-based control is the de-facto standard when it comes to controlling classical stochastic systems and processes. However, standard feedback-based control methods are challenged by quantum systems due to measurement induced backaction and partial observability. Here we remedy this by using weak quantum measurements and model-free reinforcement learning agents to perform quantum control. By comparing control algorithms with and without state estimators to stabilize a quantum particle in an unstable state near a local potential energy maximum, we show how a trade-off between state estimation and controllability arises. For the scenario where the classical analogue is highly nonlinear, the reinforcement learned controller has an advantage over the standard controller. Additionally, we demonstrate the feasibility of using transfer learning to develop a quantum control agent trained via reinforcement learning on a classical surrogate of the quantum control problem. Finally, we present results showing how the reinforcement learning control strategy differs from the classical controller in the non-linear scenarios.


Introduction
Feedback-based control is essential in many different industries and domains.Stabilizing temperatures, chemical reactions, robotics and even biomedical devices are all possible by continuously adjusting the system inputs based on real-time feedback.Applications of this type of control to quantum systems has not yet reached this level of maturity, though several other quantum optimal control methods, such as GRAPE and CRAB [1][2][3] have gained more widespread adoption.A key issue with feedback-based control for quantum systems is that measurements of the quantum system cause measurement back-action [4][5][6] , limiting the amount of information that can be obtained.In many cases that renders standard optimal control techniques inapplicable or infeasible, either because they require a model of the quantum system or because they need gradients that require many measurements to estimate.
In this work we discuss a simple quantum problem for benchmarking feedback-based control, building on the quantum cartpole [7].This control problem is based on the classical cartpole problem, which has become the de-facto standard benchmark for reinforcement learning controllers.We adapt the quantum cartpole problem to include explicit weak measurements of position and momentum, and use a continuous control parameter for feedback.In addition, we introduce a classical surrogate for this quantum problem by mimicking the measurement feedback on the system and the uncertainty in the measurements via noise.For this classical model, we investigate a standard optimal control algorithm -linear quadratic Gaussian control (LQGC), and show that this same controller can also control the quantum system based on weak measurement inputs.In regimes where the standard optimal controller struggles, e.g. for more non-linear systems or in scenarios where a noise characterization is infeasible, we demonstrate that deep reinforcement learning [8] remains a valid option for achieving control.
Deep reinforcement learning (RL) has been demonstrated to be a useful tool for control in several previous works [9][10][11][12][13][14], starting with [15].It provides a general approach to devising control strategies in cases where a model of the system's dynamics is incomplete, or where other properties such as the noise model are unknown.
This work hence fits into the more general context of machine learning applications to quantum systems [16].Some of those have used RL for feedback-based control, such as [7,17,18].An interesting recent work has also explored the use of weak nonlinear measurements as a way to compensate for purely linear controllers [19].

The quantum cartpole
The quantum system we consider in this work is derived from the well-known classical cartpole problem [20].In it, a cart rolls on a flat one-dimensional track, and a force must be applied in either direction in order to keep a pole, hinged to the middle of the cart, upright.At every timestep, the system is described by a vector s t = (x t , ẋt , θ t , θt ) containing the instantaneous position x t and pole angle θ t and their time derivatives.When the angle θ exceeds a threshold angle θ th , the failure condition is met and the controller failed to stabilize the system.The time step at which this failure condition is met is labelled the termination time t termination .
A related but slightly simpler control problem is that of a particle sliding off of a hill needing to be pushed back up, for which the state vector is simply s t = (x t , ẋt ) and for which the failure condition is when the particle slides beyond a certain distance x th , e.g., |x| > |x th |.Although this system is an inverted pendulum, the quantum version of this problem has been dubbed the quantum cartpole [4,7], in which a free particle [initialized as a Gaussian wavepacket ψ(x)] is centered on an inverted potential V (x) (for which we will discuss several choices later).The system undergoes unitary dynamics governed by the Hamiltonian and the failure condition is now set by at least 50% of the wavepacket's probability density extending beyond x th .The state vector of this system is the wavefunction ψ(x), though a controller will not have access to it.Instead, the controller has access to the results of weak measurements on the system, based on which the controller must decide to apply a particular unitary 'kick' u F to the system.This control force is realized through a momentum shift operator, i.e. u F = e −i F x .Differently from [7], we will allow the controllers to choose a continuous F (with bounded strength |F | < |F max |) and we will provide the controller access to data from repeated discrete weak measurements of position and momentum (rather than continuous measurements of position only).
The full dynamics of the problem are hence as follows.First, a Gaussian wavepacket is initialized at the top of the inverted potential.The wavepacket has a width set by σ = 1.0,

Apply Control
Weak Measurements and has an initial momentum chosen randomly from a zero mean uniform distribution of width σ p = 〈p 2 init 〉 = 0.1, where all units are set to 1.A more detailed description of the parameters and some comments regarding the units can be found in App.A.1.From then on, in every time step ∆t: (i) The system evolves unitarily under Hamiltonian (1) for a time ∆t; (ii) weak position and momentum measurements are performed, sequentially, and are reported to the controller (more details below); (iii) the unitary operator u F is applied to the system, with F chosen by the controller.
This loop is the core of the dynamics, and runs until the failure condition is met.
To investigate a trade-off between different timescales of the dynamics and control, we introduce a variable N meas representing the number of times the dynamics loop runs until the force F can be changed.This is shown schematically in Fig. 2. In each repetition we perform a weak measurement of the system's position and momentum [21], and average them into x est and p est , which are both passed to the control algorithm.Notice that this is not identical to performing an N meas times stronger weak measurement, as between every single measurement the wave function evolves in time over the duration ∆t, which is kept constant independent from N meas .
The pre-processing step of averaging the weak measurement results (rather than providing the measurements directly and using, e.g., a controller with memory), is inspired by the framestacking technique used in reinforcement learning [22] and has the goal of getting better estimates of the position and momentum, and thereby preventing the controllers from applying too strong or too weak forces.The weak measurements performed at each step are essential for control.As a specific implementation, one can consider these measurements to be performed by coupling the quantum cartpole system with an ancilla system for a short period that is then projectively measured (see App. A.2 for a detailed derivation).Without the weak position measurement, the wavepacket would continue delocalizing irrespective of the unitary force u F .Finally, we mention that we have implemented this control problem as an OpenAI Gymnasium environment [23], making it suitable for reinforcement learning control.Afterwards, the mean values of the weak measurements x est and p est are passed to a control algorithm, which will decide on a value for the stabilizing force F .

Classical surrogate of the quantum cartpole
To be able to compare the aspects of controlling this quantum system versus a classical system, we introduce a noisy (stochastic) classical version whose noise properties are tuned to mimic the uncertainty of weak measurements.For the rest of this work, the noisy classical system refers to the following implementation rather than the original classical cartpole.This classical system is a linear stochastic system, describing a classical particle on an inverted potential, given by where s t = (x t , ẋt ) is the state vector introduced above, y t describes the results of measurements (and hence the inputs to the controller), and u t is the control vector set by the control algorithm, containing the force applied to the system (playing a role analogous to u F in the quantum problem).The matrices A, B, and C describe the dynamics and measurements (see App. A.1), and the vectors w t ∼ N (0, σ dyn ) and v t ∼ N (0, σ meas ) describe noise on the dynamics and on the measurements, respectively.That is, w t and v t are normally-distributed random variables with zero mean and standard deviations σ dyn and σ meas , respectively.The uncertainty coming from the weak measurements is reflected in v t , and the measurement backaction is reflected through noise on the state, w t .Because the backaction depends on the measurement outcome, the noise models for v t and w t are correlated (see App. A.3). Also in this case, the particle (now a point-mass) is initialized on top of the inverted potential with a random momentum chosen from a uniform distribution of width 0.1.

Control algorithms
Now that we have a quantum problem and a classical problem that mimics it, we turn our attention towards two possible control strategies.In particular, we ask how well an optimal controller for the classical variant performs on the quantum version, and whether a reinforcement learning controller can go beyond such optimal control in scenarios where the latter struggles.

Linear quadratic Gaussian control
The well-known linear quadratic Gaussian control (LQGC) algorithm is a classical control algorithm that is known to optimally control a linear system subjected to Gaussian noise [24,25].
The algorithm itself consists of two parts: the Kalman filter [26,27] (the estimator) and the linear quadratic regulator (LQR) (the controller) [28].The latter assumes that we can apply linear control, u t = −K LQR x t in Eq. ( 2), and provides K LQR by minimizing a quadratic cost function J (see App. A.4.4).The performance of this controller (without the Kalman filter) on the classical problem with a quadratic inverted potential is shown in Fig. 3a for various numbers of measurements and for several values of noise, where the performance is measured in terms of t termination , which is the average number of time steps ∆t until the termination condition is met. 1 There is an intuitive trade-off where more measurements allow for better control (averaging out the noise), but where too many measurements is detrimental since they take too much time.In that time, the system either reaches the failure condition or goes beyond the point where control is possible (e.g., because a control force |F | > |F max | would be required).
Better performance can be achieved by incorporating an estimator such as the Kalman filter into the feedback protocol.The Kalman filter provides an estimate of the system's state ŝt , by using a model of the underlying dynamics to calculate the most probable state of the system based on the measurement y t−1 , the previous state estimation ŝt−1 .Appendix A.4 describes how this is done for a linear system in more detail.Figure 3b shows that when we use the Kalman filter, the trade-off disappears entirely and a single measurement provides the best result.This is true for Markovian systems, for which the current state of the system only depends on the previous state (and not further history), so that knowing the previous state provides all information to provide a good estimate of the current state.

Reinforcement learning control
Because we will move away from the scenario where LQGC is designed to work, we turn our attention to reinforcement learned control.A possible advantage of such controllers is that they can learn control without having access to a model and without access to the noise model (i.e., without explicitly making use of Eqs. ( 3)).Hence, being model free and relying only on measurement results as input, the agents we study can be applied either to the control of quantum or classical systems (though performance and optimal parameters may be different for the two cases).A thorough introduction to reinforcement learning control can be found in [8], and we mention specifically here that reinforcement learning agents are capable of learning the LQR algorithm [29].
Inspired by LQGC, rather than training a single agent to stabilize the quantum cartpole, we train two distinct agents: one for determining the control force (the reinforcement learned controller, RLC) and another responsible for state estimation (the reinforcement learned estimator, RLE).Both agents are trained using a stochastic on-policy training algorithm called the proximal policy optimization (PPO) algorithm [30], and both use continuous input and output spaces.Detailed of the training process, a short description of the PPO algorithm, and the parameters used are listed in Appendix A. 5.
The first reinforcement learning agent -the reinforcement learning controller (RLC) -is trained with the goal of providing the control input based on the raw measurement inputs.The input to the agent is hence directly x est and p est .As output the agent returns a controlling force u F from the range [−F max , F max ] for the next time step.The reward is −1 if the control fails (i.e., the wavefunction moves outside the boundaries), and 0 every time step otherwise.Testing the RLC on a classical system with inverse quadratic potential in Fig. 3a, we see that it has the same trade-off as the LQR controller, but performs slightly better due to difference in the objective in both algorithms.The LQR minimizes the quadratic cost during the run, whereas the RLC aims to avoid the worst case of the wavefunction being pushed out of the threshold.
The second agent -reinforcement learned estimator (RLE) -the is trained with the goal of replacing the Kalman filter.To do so, we provide it with the previous state estimate ŝt−1 , the mean of the N meas measurements taken at timestep t and the last control value u t .Compared to the Kalman filter, however, the agent has no knowledge about the noise covariance nor of the system's equations of motion.As output the agent returns the predicted change of the state ∆ŝ t , so that ŝt+1 = s t + ∆ŝ t .The goal of the training is to minimize the squared prediction error e 2 t , where e t = y t − CAs t−1 (see App. A.4), by providing a reward r t = −e 2 t in each time step.For the purpose of training, the controller is replaced by a simple random controller (choosing a random force every time step), since the estimation task does not depend on the actual control strategy.
Putting the estimator to the test in combination with the RLC on the classical surrogate system with inverted quadratic potential is shown in Fig. 3b, where, like the LQGC, the performance always reaches the maximum for N meas = 1, independent of the noise level.Overall, the performance of the reinforcement learned estimator is below that of the Kalman filter, indicating that learning the state estimator is more challenging than learning the controller2 and making LQGC the better choice if the system is linear and a noise model is available.

Controlling the quantum cartpole
We now turn our attention to the quantum version of the problem.In Fig. 4 we show how a reinforcement learning controller, trained on the quantum system, performs in this scenario (still with a quadratic inverted potential), once without the estimator (panel a) and then with (panel b).Here, too, the trade-off between more measurements for more information versus latency is apparent if only the controller is used.Panel a also shows that control quickly turns infeasible in the weak measurement regime (corresponding to the top of the panel), where the trade-off fades out, but still indicates a finite number of measurements remains optimal.Like in the classical case, using the estimator in addition makes it such that the optimum in N meas shifts to a single measurement as seen in Fig. 3b.

Potential variations
To comprehensively explore the capabilities of our controllers, we consider four possible combinations of controllers (LQR and the RLC) and estimators (Kalman filter and the RLE).We explore how these combinations perform as a function of different numbers of measurements, evaluating them based on the average time t termination (calculated over 10 5 runs).To check the validity and usefulness of the LQGC versus the reinforcement learning controller, we investigate different potentials that make the system non-linear.The potentials that we study are depicted in Fig. 5, which next to the quadratic inverted potential shows two more:  (i) A cosine potential V (x) = k 1 (cos(πx/k 2 ) − 1) , and The values for k 1 , k 2 and k are listed in Appendix A.1, which also shows the performance of the controllers on the classical system with these potentials.For the RL based controllers, we trained multiple agents and averaged the results of the 10 best performing ones and presenting their performance relative to the LQGC performance, to highlight the improvement of the performance, independent from the concrete performance, which depends on the underlying potential.
For the quadratic inverted potential the combination of reinforcement learning controller (RLC) and the Kalman filter performs as well as the LQGC algorithm (see panel a of Fig. 6).This is a notable result, because the LQGC is not a guaranteed optimal controller in the quantum environment.However, our findings are consistent with those presented for discrete control of the quantum cartpole [7].At the same time, we note that the reinforcement learned estimator (RLE) struggles to match the performance of the Kalman filter up to about N meas ∼ 40, and that does not remove the trade-off behavior discussed previously.For larger N meas the RLE seems to be able to match the Kalman performance.Now going to a nonlinear system, starting with the cosine potential Fig. 6b, the overall performances are similar to the linear system.It is notable to see that the combination RLC + Kalman is now able to gain a notable advantage over the LQGC, increasing the performance by ∼ 10% for a single measurement N meas = 1.Similarly, the controllers involving RLE were able to close the performance gap to the LQGC by small margins, but remain far behind.
It is for the quartic potential that the advantage of using RL becomes really evident.Here the RLC + Kalman controller is able to achieve an increase in performance of ∼ 60% compared to the LQGC.At the same, we also see that both the RLC + RLE and the LQR + RLE controllers are both able to also achieve a performance advantage over the LQGC and narrow the gap with the RLC + Kalman controller, indicating that the LQR algorithm is the main bottleneck.On a broader level, we expect that for even more non-linear problems reinforcement learning control indeed becomes the go-to choice instead of LQGC.

Transfer learning
Finally, we consider a transfer learning scenario in which we train a reinforcement learning agent on the classical surrogate, and then apply it to control the quantum system.The results of these comparisons for the different combinations of controllers and estimators are shown in Fig. 6.Interestingly, RL agents trained on the classical system perform almost identically to those trained on the quantum system (comparing the 'transfer' controllers with their counterparts).This suggests that training on a classical surrogate model for controlling the quantum cartpole is indeed a viable strategy.

Controller characteristics
To further elucidate the control strategies used by the control algorithms, we investigate the resulting distributions of position 〈x〉 and momentum 〈p〉 expectation values, shown in Fig. 7.
For this we compare the LQGC and the RLC + RLE controllers on the three different potentials.The distributions were taken by collecting the position and momentum of the wavefunction over 10 6 time steps.In order to avoid measurement artifacts from the initialization of the wavefuntions, only data from t = 300 and onwards were taken.Looking first at the LQGC, it can be observed that the distributions for all potentials are symmetrical and centered around 0, showing that the controller aims to stabilize the wavefunction at the centers of the potentials.When comparing the quadratic and cosine potentials, it is notable that the cosine potential has a wider distribution due to the fact that it starts to flatten out near the threshold.It appears that this allows the controller to stabilize it closer to the threshold for longer durations.In contrast, the quartic potential has the sharpest distribution, suggesting that it is unable to recover the wavefunction when it is close to the thresholds.
Looking at the distributions of the full reinforcement learning control algorithm (RLE + RLC) one notable disparity is observed.The distributions are neither centered around 0 nor are they symmetric.This is attributed to the training process, where the agent can develop a bias for stabilizing the wavefunction at a particular point.This is particularly evident in Figure 6: Benchmarks of the various controllers, on the quantum system depending on the number of measurements for the input.We showcase the performance of the Kalman Filter + LQR (blue) and the pure reinforcement learning controller (light red), as well as the mix of those controllers with Kalman Filter + RLC (red) and LQR + RLE (light blue).Additionally, transferred RLC + Kalman Filter and RLC + RLE (black) are presented, which were trained on the classical system and then applied on the quantum system.The performance is showcased as ratio of the average termination time between a selected controller and the Kalman + LQR controller.Each plot represents a different potential, the first being the quadratic potential, followed by the cosine potential and quartic potential.
the quadratic and quartic potentials, where the wavefunction is stabilized left and right of the center.
For the cosine potential, the distribution of the average position with RLC + RLE significantly differs from that of the LQGC.Instead of a clear peak in the distribution a broader plateau appears, indicating that the controller has learned to balance the wavefunction on the side of the potential rather than the center.timesteps for the LQR + Kalman controller (blue) as well as the RLC + RLE (red) controller and converted to a histogram of the position probability p (〈x〉).This is also showcased for the momentum 〈p〉 in the top right insets.

Conclusions and Outlook
Standard classical feedback-based control algorithms are challenged by quantum systems, due to their intrinsic measurement induced backaction and partial observability.However, for quantum systems with observables that can approximately be described by linear stochastic equations, we found that the classical LQGC algorithm performs well if full knowledge of the model as well as its noise characteristics are available.If such knowledge is not available, or if the system is non-linear, the classical controller is found to struggle.In scenarios where knowledge of the model and the noise are unavailable (as is often of the case for complex experiments), and/or where the system is non-linear, we showed that a model-free reinforcement learned controller outperforms the LQGC algorithm.
To fairly demonstrate the advantage of our reinforcement learning controller, we constructed a surrogate model in the form of a classical stochastic system, whose properties are designed to closely mimic the measurement-induced backaction of the quantum cartpole.Using this classical surrogate, we demonstrate that transfer learning is feasible by training the RL agent on this classical model and applying it to controlling the true quantum system.This opens up the possibility of more efficient training methods.
Both the LQGC algorithm as well as our reinforcement learned controller are composed of a separate estimator and a controller, and we show that without the estimator a trade-off between the number of measurements and the controllability exists.Including the estimators removes this trade-off, allowing for optimized control with just a single (weak) measurement.For the LQGC algorithm, estimating the state (with the Kalman filter) again requires knowledge of the model and the noise characteristics.The reinforcement learned estimator struggles to match the performance of the Kalman filter, but still ensures controllability with single measurements.Frame-stacking techniques improve upon this further.
Finally, we found that when analyzing the control strategies for the linear system, the reinforcement learning controller learns a strategy similar to the optimal strategy of the LQGC.For the non-linear case, the reinforcement learning controller manages to outperform the LQGC algorithm by stabilizing the system with a different control strategy that allows for a broader distribution of the system's momentum.
It would be an interesting future direction to focus on making the RL agent more autonomous, by having it choose when and how strong to perform the weak measurements.This would result in an adaptive algorithm that may learn to only measure when necessary.Similarly, instead of frame-stacking the agent could learn to use the raw weak measurement outputs instead, instead of only their average.Finally, future work could focus on extending the system in interesting directions such as time-varying potentials, non-Markovian noise, or interacting systems.

Data availability
The reinforcement learning environment [31] (in the form of an OpenAI Gymnasium) as well as the code and all configuration data used for obtaining the benchmarks in this paper [32], are made available.

A.1 System parameters
For the simulation in order to make fair comparison between the different potential, we have used the same initialization parameters on the on all three potentials, with the exception of the potential constant k.

A.2 Weak measurement
We want to perform the weak measurement on the quantum state |Ψ〉 = |ψ〉 ⊗ |φ〉, where |ψ〉 is the system state and |φ〉 the ancilla state.The two systems interact via the Hamiltonian H int = A ⊗ p.We assume that the interaction time δt is small enough, thus the time evolution is dominated by the weak measurement.The time evolution can be written as with λ = sδt and s being the interaction strength and ∆t the interaction time.Here A is a Hermitian operator with the eigenstates α, acting on the system quantum state ψ, and p is the momentum operator acting on the ancilla state φ.
Choosing the form |φ(q)〉 = 1 (2πσ 2 ) (1/4)   dq ′ exp[−q ′2 /(4σ 2 )]|q ′ 〉 for the ancilla state, perform a projective measurement in the q state using q = I ⊗ |q〉〈q|.This leaves the state in the form |Ψ q m 〉 = M q m |Ψ ′ 〉 N , and returns the measured quantity q m and the Kraus operator which is a Gaussian weighted sum of projectors onto the eigenstates of A. In case that we have a nonlinear system, we approximate the dynamics according to Eq. A.29 and set the weights to be s T t W 1 s t = H| s t : where we evaluate the W cosine and W quart every timestep, according to the latest state estimation ŝt .

A.5 Reinforcement learning training
The PPO algorithm trains a policy π θ from which the actions a are sampled allowing for exploration in the training.The policy is updated by maximizing the objective function L Since reinforcement learning is known to be vulnerable to performance collapse [34,35], caused by a few unfortunate episodes in the training, the PPO limits how far any new policy is allowed to differ from the previous one, by clipping the probability ratio π θ (a,s) with ε controlling the clipping range.The new policy does not benefit by going far away from the old policy.
In the training of the reinforcement learning models, we used 2 different sets of hyperparameters.One for the training of the RLC models and one for the RLE.The training of the RLE and RLC agents is done by utilizing the same underlying methods.There is a variance in the used hyperparameters, which were determined using grid search and are listed in the Tab. 2. During the training we tracked the returned reward after each epoch and saved the models, which returned the highest reward.
The training of the RLC agent turned out be easily affected by getting stuck in local minima, completely halting the learning process.Since this more often happened at small number of weak measurements, we have utilized transfer learning [36] to circumvent this problem.In transfer learning we trained 48 agents on N meas = 48 weak measurements.These agents were then used as the starting point for training with N meas = 47 weak measurements.This was repeated for each number of weak measurements until N meas = 1 is reached.The trained agents have then been evaluated on the potential as shown in 8 a) for the Kalman FIlter + RLC controller on a classical system with an inverted quartic potential.At the starting point of the training at N meas = 48, the agents all show a performance close to on another, but as the transfer learning continues, at around N meas = 20, the performance of the agents starts to diverge.The majority of the agents exhibits increasing performance with the number of measurements, while a small number of agents get stuck during the training and only show a   Because the training of the RLE isn't directly depending on the average termination time of the wavefunction, we don't need to make use of transfer learning and can train the 48 agents directly, independent from each other, one agent for every number of measurements we perform.In Fig. 8 b) the training process of the RLE agent for N meas = 1 in the classical system with inverse quartic potential is shown, demonstrating a fast converging in the training.

A.6 Extended classical benchmarks
In Fig. 9 the performance of adding an estimator to the RLC and LQR is showcased.Compared to Fig. 3 the noise level is fixed at σ meas = 0.7, resulting in higher overall performance for all controller.Furthermore the performance gap has widened while using an estimator, with RLC

Figure 1 :
Figure 1: The quantum cartpole setup.A wavepacket ψ(x) is placed on top of an inverted potential V (x), and a controller must ensure that at least 50% of |ψ(x)| 2 stays within the interval [−x th , x th ].As input, the control algorithm receives weak measurement outcomes, and as output it sets the strength of a unitary control force that is applied to the wavepacket.

Figure 2 :
Figure2: Scheme of the dynamics of the quantum cartpole problem.In every time step ∆t a force F is applied, the wavefunction is evolved in time and a weak position and momentum measurement are performed.These steps are repeated N meas times.Afterwards, the mean values of the weak measurements x est and p est are passed to a control algorithm, which will decide on a value for the stabilizing force F .

Figure 3 :
Figure3: Controller performance for noisy classical cartpole (with inverted quadratic potential).Plot a) compares the RLC against the LQR for different levels of measurement noise σ meas (see Eqs 3), showing a comparable increase and decrease in performance with increasing number of measurements.The RLC reaches a higher maximum compared to the LQR.In b) the noise level is fixed at σ meas = 0.8 and state estimators in form of the Kalman Filter and RLE are added to the comparison, shifting the peak performance to a single measurement.

Figure 4 :
Figure 4: Controller performance for quantum cartpole with an inverted quadratic potential.Shown is the performance of various controllers (measured in the termination time t termination ) indicating a trade-off between the number of measurements N meas and the strength of the measurement, indicated by the width σ ancilla of the measurement wavefunction (see Appendix A.2 for details).Larger σ ancilla corresponds to weaker measurements.Panels a) and b) show RLC and RLC + RLE as controllers, respectively, with the heat maps showing the average termination time t termination on a logarithmic scale and the white line indicating t termination = 10 3 .

Figure 5 :
Figure5: The different potentials used for the quantum cartpole problem.The cosine and quartic (red and green) potentials are used to demonstrate behavior on nonlinear systems, while the quadratic (blue) potential is used for a linear system.

Figure 7 :
Figure 7: Distributions of the position and momentum of the stabilized wavepacket on the quantum system for the three different potentials: (I) quadratic, (II) cosine, (III) quartic.The position 〈x〉 of the wavefunction was tracked over 10 6 timesteps for the LQR + Kalman controller (blue) as well as the RLC + RLE (red) controller and converted to a histogram of the position probability p (〈x〉).This is also showcased for the momentum 〈p〉 in the top right insets.

Figure 8 :
Figure 8: Visualization of the training results for training agents on the classical system with inverse quartic potential.In a) the average termination time of all agents after the training is shown against the number of measurements performed.For every measurement number 48 different agent are shown, and are color coded depending to rank their performance.In b) the process of training a RLE agent is showcased, where the reward are plotted against the first 300 training epochs.

Figure 9 :
Figure 9: Comparison of RL vs LQR & Kalman Filter on the classical cartpole (with inverse quadratic system).The noise level is fixed at σ meas = 0.5 and state estimators in form of the Kalman Filter and RLE are added towards the comparison.

Figure 10 :
Figure10: Benchmarks of the various controllers, on the classical system depending on the number of measurements for the input.We showcase the performance of classical controller using Kalman Filter + LQR (blue) and the pure reinforcement learning controller (light red), as well as the mix of classical and reinforcement learning controllers with Kalman Filter + RLC (red) and RLE + LQR (light blue).The performance is showcased as ratio of the average termination time between a selected controller and the Kalman + LQR controller.Each plot represents a different potential, the first being the quadratic potential, followed by the cosine potential and quartic potential.

Figure 11 :
Figure 11: Distributions of the position and momentum of the stabilized wavepacket on the classical system.The position 〈x〉 of the wavefunction was tracked over 10 6 timesteps for the LQR + Kalman controller (blue) as well as the RLC + RLE (red) controller and converted to histogram of the position probability p (〈x〉).This is also showcased for the momentum 〈p〉 in the top right inserts.The three plots represent measurement from different potentials, those being the quadratic (I), cosine (II) and quartic (III) potential.

Table 1 :
Parametrization of the environment used for the benchmarks.Including the parameters of the potentials, the time evolution, and the weak measurements.All units are set to 1.

Table 2 : Hyperparameters of the reinforcement learning approach, specifying
the initialization of the RLC and RLE agents and their respective training process.