Fast Tensor Disentangling Algorithm

Many recent tensor network algorithms apply unitary operators to parts of a tensor network in order to reduce entanglement. However, many of the previously used iterative algorithms to minimize entanglement can be slow. We introduce an approximate, fast, and simple algorithm to optimize disentangling unitary tensors. Our algorithm is asymptotically faster than previous iterative algorithms and often results in a residual entanglement entropy that is within 10 to 40% of the minimum. For certain input tensors, our algorithm returns an optimal solution. When disentangling order-4 tensors with equal bond dimensions, our algorithm achieves an entanglement spectrum where nearly half of the singular values are zero. We further validate our algorithm by showing that it can efficiently disentangle random 1D states of qubits.

Another approach (used in Refs. [14,17]) is to iteratively minimize the entanglement entropy [defined later in Eqs. (10)] across part of the tensor network, as depicted in Fig. 1. However, these iterative methods are also very CPU intensive. An iterative first-order gradient descent algorithm can converge very slowly, especially when narrow valleys are present in the entanglement entropy cost function 1 . Convergence is even more challenging for Renyi entropies S α with α ≤ 1/2 due to |λ| 2α singularities for small singular values λ, which can even prevent convergence to a local minima in the limit of infinitesimal step size for many algorithms. Applying a second-order Newton method can require less iterations. However, for large bond dimensions χ the CPU time and memory requirements grow rapidly as O(χ 12 ) and O(χ 8 ), respectively, with e.g. χ = 16 requiring roughly 40 CPU core hours and 34 GB of RAM just to diagonalize and store the Hessian for a single iteration.
The CPU time of our algorithm scales as O(χ 3 1 χ 2 3 + χ 6 1 ) when χ 1 = χ 2 and χ 3 = χ 4 . 4 This CPU complexity is as fast or faster (when χ 3 χ 1 ) than the complexity O(χ 3 1 χ 3 3 ) for computing the singular values of A across the dotted red line, which are needed to calculate the entanglement entropy across the dotted red line. This makes our algorithm asymptotically faster than just a single step of any iterative algorithm that attempts to minimize the entanglement entropy. For χ 1 = χ 2 = χ 3 = χ 4 = 16, our algorithm only requires only about 10ms of CPU time. 1 Increasingly narrow valleys occur for the tensors in the later rows of Tab. 1 2 In Appendix B, we generalize the algorithm to be applicable when χ 1 > χ 3 or χ 2 > χ 4 . 3 Here, unitary means that k U i j,k U * i j ,k = δ ii δ j j and i j U i j,k U * i j,k = δ kk . 4 See Appendix A for more complexity details.
We describe our algorithm in Sec. 2, and then benchmark it against iterative optimization of the entanglement entropy in Sec. 3. In Sec. 4, we show that our algorithm is capable of efficiently disentangling random initial states.

Algorithm and Intuition
Algorithm 1: Fast tensor disentangling algorithm [25] Input: b ← dominant left and right singular vectors 5 of (r · A) ab with (i, j) grouped via the ordering described in main text The algorithm is summarized in Algorithm 1. Below, we explain the algorithm in detail along with the underlying intuition.
To gain intuition, we will consider a simple example where the input tensor A k,ab is just a tensor product of three matrices: where we are grouping the indices k = (k 1 , k 2 ) and similar for a and b. Then it is clear that an ideal unitary U i j,k should decompose A k,ab as follows: since this minimizes the entanglement across the cut shown in Fig. 1.
Note that U i j,k does not have any dependence on M (1) , M (2) , or M (3) . Rather, U i j,k only needs to be a basis that matches the index i with M (1) and j with M (2) . The indices a and b give us a handle on this basis since M , which also depends on a and b. Therefore, the intuition behind our algorithm will be to project out M (3) so that U i j,k can be computed.
Step (1) of the algorithm begins by choosing a random vector r k of length χ 1 χ 2 . (2) Then compute α (3) * a and α (4) b : the dominant left and right singular vectors 5 of (r · A) ab . For the simple example, r · A will be a tensor product of two matrices: M (3) and . This implies that α (3) a and α (4) b will each be a tensor product of two vectors: where β (3) * ), respectively. This allows us to isolate M where only the largest χ 1 and χ 2 singular values are kept in the first and second lines, respectively. Thus, V (3) and V (4) are χ 3 × χ 1 and For the simple example, the matrices V (3) and V (4) only depend on the thin SVD of up to an unimportant tensor product with a vector γ (3) or γ (4) . This allows us to project out M (3) in the following step, as seen in the bottom right of Eq. (8).
(5) Compute: 7 For the simple example, due to the direct product structure of V (3) and V (4) shown in Eq. (7), B k,i j takes the form shown in the bottom right of Eq. (8). Importantly, M (3) only affects B k,i j by a multiplicative constant so that the indices i and j give us a good handle on how to split the index k. If B k,i j were unitary, which would be the case if M (1) and M (2) are unitary, then we could take U = B † to minimize the entanglement across the cut as in Eq. (2). Since B k,i j is generally not unitary, we instead use a Gram-Schmidt orthonormalization of (B † ) i j,k . This produces the desired result [Eq. (2)] for the simple example (up to trivial multiplication of unitary matrices on i and j of U i j,k ).
If χ 1 = χ 3 and χ 2 = χ 4 , then V (3) and V (4) in Eq. (8) are just unitary matrices. Therefore V (3) and V (4) only change the basis of vectors that are Gram-Schmidt orthogonalized in step 6. One could then consider skipping steps 1-5 and instead input B k,i j = A k,i j to step 6. The ansatz in Eq. (1) would still be optimally disentangled in this case. However, since the output of Gram-Schmidt depends on the initial basis, the resulting disentangling unitary will be different in general. Indeed, the resulting disentangling unitary will typically be significantly worse for general input tensors A k,i j . 8 The algorithm is not deterministic since r k is random, which helps guarantee the tensor product structure in Eq. (3) by splitting possibly degenerate singular values. Thus, it could be useful to run the algorithm multiple times and select the best result. Also note that the (statistical) result of the algorithm is not affected if A is multiplied by a unitary matrix on any of its three indices. As such, it is not useful to rerun the algorithm on U · A (rather than just A) in an attempt to improve the result. 7 In some cases, it could be useful to try both orderings and return the best resulting unitary. 8 For the χ 1 = χ 2 = χ 3 = χ 4 = 2 tensors that we consider in Tab. 1, skipping steps 1-5 results in an entanglement S fast that is about 20 to 35% larger (on average).
Throughout this section, we assume χ 1 = χ 2 and χ 3 = χ 4 . In Tab. 1, we show how well our algorithm minimizes the Von Neumann entanglement entropy: where λ i are the singular values of U · A across the red line in Fig. 1 [i.e. singular values of (U · A) i j,ab when viewed as a (χ 1 χ 3 ) × (χ 1 χ 3 ) matrix with indices (ia) and ( j b)]. We also investigate the truncation error that results from only keeping the first χ singular values: In the first four rows, we investigate random χ 2 1 × χ 3 × χ 3 tensors of complex Gaussian random numbers. We then consider random tensors with fixed singular values λ i = 1/i or λ i = 2 −i , which are generated using where W and V are random unitaries (e.g. k 1 a W k 1 a,i W * k 1 a, j = δ i j ). In the final three rows, we generate tensors using where v (n) are normalized random complex vectors. The later types of tensors have more structure and are (in a sense) less dense than the previous types. We find that our fast algorithm performs best for more structured tensors (lower rows in the table) and exhibits the greatest speed advantages for larger χ and more structured tensors. The fast algorithm typically results in an entanglement S fast within 10 to 40% of the global minimum S min (which we approximate by running a gradient descent algorithm on several different initial unitaries for each input tensor). In the 5th column of Tab. 1, we show how much longer (on average) it takes the gradient descent algorithm to optimize down to the entanglement S fast reached by our fast algorithm; we find speedups ranging from 20 to 20,000 times as the bond dimension is increased from 2 to 16.
In the final two columns, we find that our fast algorithm achieves a truncation error to bond dimension χ 1 that is within a factor of two of what is obtained by minimizing S 1 . In Fig. 2, we study the truncation error in more detail. We find that if we truncate to a large enough bond dimension, our fast algorithm achieves a smaller truncation error than what is obtained by minimizing S 1 . Both algorithms greatly reduce the truncation error from original random tensor A (which we reinterpret as a tensor with four indices instead of three).

Wavefunction Disentangle
We further validate our algorithm by studying how well it can disentangle a random wavefunction of 10 qubits. 9 That is, starting from a wavefunction of 2 10 complex Gaussian-distributed 9 See Ref. [33] for an MPS approach to wavefunction disentangling.  (13)], where U · A has dimensions χ 1 × χ 1 × χ 3 × χ 3 , we list: rough CPU time of our fast algorithm; roughly how much faster this is (on average) than a gradient descent algorithm of the entanglement entropy that halts at the entanglement S fast reached by our fast algorithm; how close the resulting entanglement entropy S fast of our fast algorithm is compared to the minimum S min ; the same but for a random unitary (for comparison); the truncation error ε fast χ 1 [Eq. (11)] to bond dimension χ 1 for our fast algorithm; and the same for the unitary that minimizes the entanglement to S min (for comparison). The last four columns show means and the two-sided deviations to the 16 th and 84 th quantiles (e.g. a normal distribution would be shown as µ +σ −σ .) tensor  Fig. 3, to reduce the amount of entanglement across any cut of the wavefunction. Thus, we take Fig. 1 for i = 1, 3, . . . , n − 1 and then i = 2, 4, . . . , n − 2 to calculate the two layers of unitaries shown in the inset of Fig. 3, for which n = 10. 10 We show how much entanglement is left after a given number of layers of unitaries. We compare data from our fast disentangling algorithm to gradient descent of the entanglement entropy S [Eq. (10)].
When the circuit depth is small, our fast algorithm disentangles at a slightly slower rate per circuit layer, but much faster per CPU time. When the circuit depth is larger and the wavefunction has little entanglement left, our algorithm performs better than minimizing the entanglement entropy. Gradient descent of S gets stuck at larger depth due to narrow valleys in the cost function S, which result in very small (< 10 −8 ) step sizes causing our gradient descent algorithm to halt.
We also compare against initializing the gradient descent of S algorithm with the result of our fast disentangling algorithm. This is shown in blue in Fig. 3, and achieves the best disentangling rate in both limits, while also speeding up the gradient descent algorithm by a factor of two.
After 500 layers consisting of 2250 2-qubit gates, our fast algorithm removed almost all of the entanglement. An arbitrary 2-qubit gate can be implemented using three CNOT gates  Figure 3: The residual entanglement after applying layers of unitary operators to a random 10-qubit wavefunction for various algorithms. The residual entanglement is the maximum entanglement across any left/right cut of the wavefuntion. We apply each method to the same three random wavefunctions, resulting in three nearly overlapping lines for each method. We also show the amount of CPU time used for each method to obtain the data shown. (inset) Two layers (i.e. depth=2) of unitaries acting on the wavefunction. along with 1-qubit gates [26][27][28]. Therefore, the fast algorithm's circuit of 2250 2-qubit gates can be implemented using only 6750 CNOT gates. For comparison, it is possible to exactly disentangle an n = 10 qubit state using a circuit of 9 × 2 n+1 ≈ 18, 000 nearest-neighbor 2qubit CNOT gates along with many 1-qubit gates [29,30].

Conclusion
We have introduced, provided intuition for, and benchmarked a fast algorithm to approximately optimize disentangling unitary tensors. Example Python, Julia, and Mathematica code can be found at Ref. [25].
We expect our algorithm to be useful for tensor network methods that require disentangling unitary tensors. Due to its speed, our fast method can allow for simulating significantly larger bond dimensions than previously possible. The advantages of larger bond dimensions could outweigh the disadvantage of the non-optimal disentangling unitaries that our algorithm returns. Nevertheless, if more optimal unitaries are required, our fast algorithm can still be useful as a way to initialize an iterative algorithm.
For future work, it would be useful to consider an ansatz of tensors that are a tensor product of our ansatz Eq. (1) with a GHZ state (i.e. A k,ab = 1 if k = a = b else 0). Such tensors are the generic form of stabilizer states with three indices (up to unitary transformations on the three indices) [31,32]. Although these tensors can be optimally (and relatively easily) disentangled using a Clifford group unitary, our fast algorithm performs very poorly on these tensors.
(3) Perform a (slightly modified) fast disentangling Algorithm 1 on where the second index of A k,(aa )b is the grouped index (a, a ). The fast disentangling algorithm is modified in step 6: the (i, j) indices should always be grouped using the ordering that would be chosen if χ 1 ≤ χ 2 .