Improved Pseudolikelihood Regularization and Decimation methods on Non-linearly Interacting Systems with Continuous Variables

We propose and test improvements to state-of-the-art techniques of Bayeasian statistical inference based on pseudolikelihood maximization with $\ell_1$ regularization and with decimation. In particular, we present a method to determine the best value of the regularizer parameter starting from a hypothesis testing technique. Concerning the decimation, we also analyze the worst case scenario in which there is no sharp peak in the tilded-pseudolikelihood function, firstly defined as a criterion to stop the decimation. Techniques are applied to noisy systems with non-linear dynamics, mapped onto multi-variable interacting Hamiltonian effective models for waves and phasors. Results are analyzed varying the number of available samples and the externally tunable temperature-like parameter mimicing real data noise. Eventually the behavior of inference procedures described are tested against a wrong hypothesis: non-linearly generated data are analyzed with a pairwise interacting hypothesis. Our analysis shows that, looking at the behavior of the inverse graphical problem as data size increases, the methods exposed allow to rule out a wrong hypothesis.


Introduction
Concepts and tools from statistical mechanics turn out to be valuable resources to analyze the behavior of systems of very diverse nature, ranging from neuroscience [1,2,3] to systems biology [4,5,6], economics [7,8,9], finance [10,11], sociology [12] and language evolution [13], just to name a few.In this work, we start from the use of statistical mechanics to characterize the behavior of complex optical systems [14,15], in which many modes propagate or develop due to external sources and nonlinearities might be fundamental in describing the system behavior.
The study of Hamiltonian multi-body interaction systems as models to describe the interactions among electromagnetic modes in multimode lasers has provided important understanding on several experimental studies [16,17,18,19].In these works, the interests was mainly in the study and description of the electromagnetic output assuming some network of interactions among the modes.In this paper we will focus on statistical inference, i.e., on the inverse problem: the reconstruction of the statistical model parameters from data acquired in real experiments or in numerical simulations of the output.
Even though our original motivation arises in the framework of nonlinear optics and laser physics [14,20,21,22,23,24], in this work we will concentrate on state-of-the-art techniques to solve inverse problems.Indeed, the techniques here presented can be applied to a large class of models with multi-body interactions.
The most studied inverse problem within the statistical physics community is the inverse Ising problem.This can be regarded as the E. coli of statistical physics: the thermodynamic properties are known and the different techniques can be tested on the thermodynamic phases.In the Ising model the interacting variables are discrete spins, σ = ±1, while the couplings describing their interactions can be generically thought of as random variables chosen from desired probability distributions.Many inference techniques have been tasted on the Inverse Ising model: from mean field methods [25] and their extensions [26,27] to various likelihood maximization techniques [28,29].The literature on the inverse Ising problem is very wide, see e. g., Ref. [30] for a recent review.On the other hand, few studies can be found on continuous spin models [15,31].Moreover, inference techniques on systems with strong nonlinear responses have not yet been exhaustively explored.Beside the mentioned nonlinear optics, there are many other research fields in which nonlinear inference techniques are relevant: constraint satisfaction problems [32,33], error correcting codes [34,35], non linear neural networks [36], peptide sequences in DNA [37], non linear electric networks [38], fish shoals [39,40], heterogeneous and frustrated glassy systems [41,42,43,44,45,46,47,48,49].
This study follows a previous paper [50], where we have presented the physical systems of interests as well as the inference techniques used, namely pseudolikelihood methods with

Test Model
In this section the models and the physical systems of interest are introduced.The interested reader can find more details in [14].This paper is organized in such a way to let the reader interested only in inference techniques applied to nonlinear systems to skip this section.

Complex Spherical Model: Relevance in Photonics
The class of Hamiltonians that we will consider has the form where, in the most general case, a j are complex numbers and J ijkl are the complex interaction couplings among them.The sum considered in Eq. ( 1) is subjected only to distinct indices.A complete derivation of the above model in the context of optics is presented in Ref. [21].
The Hamiltonian description comes about as the electromagnetic field can be expressed as where E k (r) are the time-independent solutions of the wave equation in the medium, the socalled normal modes.The a k (t)'s vary on a time scale much longer than ω −1 k [55].Each mode can then been seen as a phasor "spin", complemented with its own mode intensity A 2 k = |a k | 2 and phase φ k = arg(a k ) ∈ [0, 2π].As we can see from Eq. (1), we consider systems with 4-body interaction terms, representing the first nonlinear order term for optical systems with time reversal symmetry (i.e., with optical susceptibility χ (3) ).Eventually, because of the total power constraint and regulatory mechanisms such as, e.g., gain saturation responsible for the stationarity of the laser regime, the a k 's satisfy a global spherical constraint, i. e., N k=1 |a k | 2 = const × N , which assures a bounding of the energy (1).The inverse of the pumping rate plays the role of the effective temperature.The model introduced in Eq. (1) can, then, describe how the intensity is distributed among the modes in the stationary regime and how and if the phases of the modes would be synchronous.In this setup, the interaction couplings J jklm express the interaction among the modes due to the competition for the energy in the same region of the gain medium.
Starting with Ref. [56], Gordon and Fischer were the first to consider multimode lasers in a statistical-mechanical framework.They studied the statistical properties in homogeneous cavities taking into account non-linear effects like gain saturation and intensity dependent refractive index: the system shows a thermodynamic phase transition, i.e., the transition to the so-called multi-mode mode-locking ultrafast laser regime, in which the modes oscillate in a phase-locked behavior.Different thermodynamics phases are also observed in random lasers [20,21,57,58].

XY model, aka Quenched Amplitude model
Being interested in studying possible mode-locking regimes characterized by strong correlations among the phases of the modes, one can consider to investigate the situation of all the intensities |A j | as being quenched with respect to the phases, i.e., varying on much longer timescales in respect to the phases and considered as constants.Starting from Eq. (1), by a rescaling of the coupling coefficients A j A k A l A m J jklm → J jklm , we are left with the Hamiltonian of the XY model: being J R and J I the real and imaginary part of the network coupling.We have analyzed the inverse problem for both models: the complex spherical/phasor model, Eq. ( 1), the XY /rotor model and Eq. ( 3).Introducing the XY model gives also the possibility to analyze the dynamics on sparse graphs with a number of total quadruplets scaling only with N .Indeed, if the number of total quadruplets does not scale at least with N 2 , the dynamics of the spherical model displays power condensation: all the energy condensates in a quadruplet of phasors, the behavior is not homogeneous and ergodicity is not satisfied.

Frequency Matching Condition and mode-locking
Deriving Eq. ( 1) (see, for example, [14]), one can see that 4-modes can interact if and only if the following Frequency Matching Condition (FMC) is satisfied, i.e., where with ω k we indicate the frequency of mode k and with γ the linewidth.Our interest will rely on whether or not the inference techniques proposed are able to reconstruct this underlying structure.Moreover, knowing this constraint in advance, it might be possible to determine the frequency distribution of the modes.The 24 possible permutations of the 4 indexes in Eq. ( 4) can be divided into 3 non-equivalent subsets of 8 permutations each.So, the FMC can be satisfied by one, or more, of these independent classes of permutations.The term "narrow band" [56,58,59,60,61] indicates the case in which all modes oscillate in a relatively small frequency bandwidth, i.e., ω l ω 0 within the linewidth γ of mode l and the FMC does not play any-role in the system behavior.For comb-like distributed frequencies, we have ω l = ω 0 + lδ with γ δ.In this case, the systems is not fully connected and the connectivity of each mode depends on its frequency: the FMC plays a role in the construction of the interaction network.These graphs are termed Mode-Locked (ML) graphs.In this paper, we will consider both narrow band and ML graphs.

Inverse problem: data and inference techniques
The aim of supervised statistical learning is to predict the model parameters from data describing the dynamics of the system.Having modeled our optical system in terms of complex spherical or XY spins interacting on a given graph, we now want to analyze the inverse problem: reconstructing the network of interaction as well as the coupling strength.
Since at present we have no access to experimental data, we will use numerical experiments to generate data providing, as a first step, an easy interface with real world experiments.The SciPost Physics Submission data used have been generated through Monte Carlo numerical simulations of the systems at equilibrium.Both systems, Eq. ( 1) and Eq. ( 3) have been simulated.We have considered both the case of sparse graphs, in which the number of interacting quadruplets scales like the number of variables, N q ∝ N , and more dense graphs, in which N q ∝ N 3 .Notice that a complete dense graph would contain O(N 4 ) interacting quadruplets.Furthermore, we have considered strict frequency matching conditions, cf.Eq. ( 4), based on comb-like single mode resonance distributions (γ δω), as well as narrow-band conditions (γ > δω).Considering the physical model of interests, the couplings Js will depend on the spatial distribution of the modes and on the non-linear response of the system.Particularly for random lasers, in which the modes are randomly distributed in space and the nonlinear susceptibility is highly inhomogeneous, we expect a variation of the value of the coupling among the interacting quadruplets.Because of the partial knowledge of modes localization and the very poor knowledge of the nonlinear response so far in experiments, the random values for the Js can be taken from any physically reasonable arbitrary probability distribution.We will then take the couplings of a multibody interacting network, with number of variables N and number of couplings N q ∼ N z , as generated either through a bimodal distribution, i.e., with Ĵ = 1/N (z−1)/2 , or through a Gaussian distribution of mean square displacement σ ∼ Ĵ.Indeed, we can analyze the performance of the inference techniques for both discrete and continuously distributed couplings.The methods exposed also work for the simpler cases of uniform couplings like in standard mode-locking lasers [22,56,62,63].We inferred data within the equilibrium hypothesis, expressed by the Boltzmann-Gibbs distribution in the likelihood function.To reduce the Monte-Carlo steps required for thermalization we used the parallel tempering algorithm.For the "narrow band" approximation, where the frequencies of the modes do not play any role in the construction of the interaction graph, we generate instances of Erdös Rényi (ER) graphs: each quadruplet is added to the graph independently, with probability M/ N k , where M is the total number of quadruplets in the graph.In order to obtain a Mode-Locked (ML) graph, starting from a so generated ER graph, we remove those quadruplets that do not satisfy the Frequency Matching Condition (FMC), Eq. (4).As proved in Appendix A, in the thermodynamic limit, the number of removed quadruplets scales like N/2 while the node connectivity distribution tends to a Poissonian, as in ER-like graphs [64].On the other hand, finite size effects are clearly stronger in the ML graph with respect to the ER-like graph for the relatively small simulated sizes.

Pseudolikelihood method
A standard approach used in statistical inference is to predict the model parameters by maximizing the likelihood function.This technique, however, requires the evaluation of the partition function that, in the most general case, concerns a number of computations scaling exponentially with the system size.Different approaches have then been used: Boltzmann machine learning [65,66], mean field methods [67,68] with various extensions [26,27,69,70].A local alternative to the likelihood function was introduced and referred to as Pseudo Likelihood Function (PLF) [28].It was first developed for spatial models in Ref. [71] and later extended as an alternative to maximum likelihood function for networks.The most attractive part of the PLF is its computational tractability in comparison to the likelihood function.It keeps a good balance between the computational complexity and the efficiency of the estimation.The PLF is maximized with respect to its parameters to find the corresponding estimators.This method is known as Pseudo Likelihood Maximization (PLM).Such a logistic regression based method proves to work very efficiently on sparse networks.As well as the likelihood maximum estimator, the PLM estimator is consistent and asymptotically normal , i.e., as the number of training samples increases (i) the inferred values tend to the true values and (ii) the distribution of the inferred parameters tends to a Gaussian one.We are now deriving the PLF for non-linearly interacting wave systems.
Using Eq. ( 1), we have that the probability of observing a configuration a given a set of couplings J is: Eq. ( 1) can be rewritten as: Where a \j is the set of all amplitudes but a j and we have defined the complex-valued local effective fields as We want to determine P i (a i |J , a \i ) that, using Eq. ( 7), can be written as All the terms in j a j H j [a \j ] that do not depend on a i will simplify with the denominator.We write explicitly: The sum over j of the terms that depend on a i gives another term a i H i [a \i ] and where The factor 4 will be absorbed in the definition of inverse temperature.The integral sum over a i can be successfully carried out by evoking the global spherical constraint j |a j | 2 = N , with constant .Given all the a \i , indeed, the value of |a i | is fixed by and a i simply reduces to an integral on the angular phase variable φ i ∈ [0 : 2π[.Assuming that we are given M independent configurations a µ , µ ∈ 1, . . ., M , extracted from the Gibbs measure, the log-pseudolikelihood function eventually reads Next step is to minimize −L i with respect to the Hamiltonian parameters that we want to infer: {J}.The stationary solution, in general, can only be computed by using a local gradientbased minimization [72].To do so we need to compute explicitly the partial derivatives of −L i with respect to each coupling constant: where we denoted

Pseudolikelihood functional with rotor variables
Rewriting the complex amplitude in polar coordinates a i = A i e ıφ i we have the following expression for the marginal, Eq. ( 12), where and I 0 (x) is the modified Bessel function of the first kind: When the couplings are considered real-valued, the polar expressions of the local effective fields in Eq. ( 7) can be rewritten by substituting Eq. ( 9) with F R jkl = cos φ j cos φ k cos φ l + cos φ j sin φ l sin φ k + cos φ l sin φ j sin φ k + cos φ k sin φ j sin φ l 3 (20) In this case, the log-pseudolikelihood functional L i , Eq. ( 14), and its gradient, Eq. ( 15), simplify to We note that in the mean-field-like inference, one crucial minimal criterion for the inverse problem to be tractable is that the number of data observations M has to be larger or equal to N , in order for the correlation matrix to be invertible.In the present method this lower bound is not strictly requested, though a unique solution to PLM is guaranteed only when M is larger than the number of coupling constants to be inferred.We will now compare two pseudo-likelihood techniques: PLM with 1 -regularization [28,73] and the PLM with decimation [29].

Improved Pseudo Likelihood Maximization with l 1 regularization: hypothesis testing
A regularizer is usually required to prevent overfitting in the minimization procedure.It is done so by putting a kind of prior to enforce couplings to take small values [28].One particular kind of regularizer used is the l p -norm regularizer.It is defined for a vector x = (x 1 , . . ., x N ) as, Any regularizer should be convex so that the convexity of the inverse problem remains intact.
In this approach, we add an 1 norm: The 1 has proved to be special with respect to p > 1 norms, because it performs well on sparse problems, where only a few parameters are actually non-zero.This is the case, e. g., of sparse graphs, in which the number of couplings per variable, the "connectivity" c = N q /N , does not grow with N1 .The reconstruction of the topology is further enhanced by the so called δ-thresholding [73], i.e., couplings that are inferred less that δ are set to zero.This techniques comes with its own shortcoming in the decision of the value of δ.Indeed, there can be cases where the zero and non-zero couplings are not clearly separated [29], see, e.g., the left panel of Fig. 12 in Sec.4.4 for low number of samples M .With the knowledge of the probability distribution of the estimators, this problem can be overcome by using an accurate hypothesis testing scheme.It is known that as M → ∞, the probability distribution of the PLM estimator converges to a Gaussian distribution centered around the true value of the coupling with variance estimated by the diagonal elements of the inverse of the Fisher information matrix [75].The elements I i ab of the information matrix are defined through: where a, b indicate two possible quadruplets to which node i might belong to, i.e., I i ab ≡ I i jkl,j k l .Then, we can determine, for every estimated value, if it is compatible with a Gaussian centered in zero, i.e., if the hypothesis for the true coupling to be zero must be accepted or rejected.Note that, in order for the procedure to be consistent, the eigenvalues of the Fisher Information Matrix need to be bounded from below.This condition together with the requirement that the entries of I i ab related to nonneighbors of i cannot exercise an overly strong effect on the subset related to the neighbors of i assures that the PLM with l 1 regularization has a unique solution and correctly reconstruct the neighbors of i if enough number of samples are provided (M larger than the number of parameters to be inferred) [28].These requirements are then checked2 in our data.
The hypothesis testing is subsequently developed as follows.At the initial step, after finding the maximum of the PLF, the inverse of the Fisher information matrix is evaluated, Eq. ( 26).In the case of rotors (but proceeding analogously for phasors) from Eq. ( 22), we have: where B (x) is defined as: In Eq. ( 27), to simplify the notation, we have included the factor β through a rescaling of the Js.The diagonal terms of the inverse Fisher matrix, σa , are then computed as the estimators for the variances of the distributions P ( Ĵa ): As a further step, every coupling is hypothesized to be zero and it is verified whether every estimated value is compatible with the P (J a ) = N (0, σa ) distribution: we construct a confidence interval C n containing the estimated value Ĵa within a 97.5% probability; if the inferred Ĵa is contained in C n the zero hypothesis cannot be ruled out and that coupling is considered to be zero and taken away from the inferred network.We conclude this section noting that, by maximizing each L i separately, one has 4 different estimated values for each quadruplet coupling.For the final estimated value, the mean is usually evaluated but we remark that the information on the symmetry of couplings of the system is not used in the inference procedure.We will now see how, in the Pseudo Likelihood Maximization with Decimation, this problem is overcome, further reducing the number of unknown couplings by a factor four.

Pseudo Likelihood Maximization with Decimation
In the PLM with decimation [29], instead of maximizing N different pseudo-likelihood functions L i , an average PLF L over all variables is maximized [29]: One of the advantages over PLM-l 1 is that we are not perturbing the function to be maximized by adding a regularizing term and there is no choice of the λ parameter to be carried out.To reconstruct the set of non-zero couplings, the smallest estimated couplings are recursively put to zero and the maximization procedures is repeated until the best inferred model is achieved.
Let us indicate with J 1 /J 0 the set of non-zero/zero couplings.The procedure goes at follows.At the zeroth step of decimation, J * 0 , the inferred set of null couplings, is empty.At each step we maximize the L function obtaining the set J * of inferred couplings.We sort them by absolute value and we move the least ρ couplings from J * 1 to J * 0 .That is, we decimate ρ couplings from the system.We want to stop the decimation procedure when J 0 = J * 0 , i. e., all the couplings that are zero in the original system are put to zero in the inferred system.At each step, the PLF is between its value PLF max , evaluated when all possible couplings are considered, i.e., a fully connected graph with J * 0 = ∅, and PLF min , i.e., an empty graph with J * 1 = ∅: When we decimate couplings that are actually in J 0 , the value of PLF does not change from PLF max since we are decimating irrelevant couplings.As the set J * 0 approaches the real J 0 , the chance of eliminating an existing coupling increases and the PLF starts decreasing.To determine more precisely the fraction x of remaining couplings where the PLF starts decreasing, a tilted PLF is defined: with x = non-decimated couplings total number of couplings (32) At zeroth, when the graph is fully connected and x = 1, tPLF is zero.For an empty graph, x = 0 and PLF=PLF min , the tPLF= 0 once again.As x is decreasing from 1 to 0, we will observe first a linear increase up to a maximum and then a decrease [50].The best representation of the real network occurs at the value of x such that tPLF is maximum.The decimation stops there.

1 -regularization PLM and a priori λ estimation: no-match parameter
For the PLM with 1 -regularization the value of the λ regularizer is usually chosen arbitrarily and checked a posteriori: assuming that one knows the solution of the inverse problem for one set of data, the best λ is the one yielding minimum reconstruction error for the inferred couplings and -possibly simultaneously -the best network reconstruction of the inferred system.
In this section, we develop a mechanism through which we can choose the best λ regularizer a priori, without any knowledge of the couplings.Cross-Validation (CV) techniques are also often applied to determine the best value of λ on supervised learning algorithms (see, e.g., Ref. [76]) and they do not require the knowledge of any solution of the inverse problem.A CV method might be developed on the following scheme: (i.) the observed configurations are divided in two sets, a training and a validating set; (ii.) the training set is used to fit the model, i.e., to determine the interaction couplings as a function of the trial value for the regularizer; (iii.) a Monte Carlo dynamics of the model with these inferred couplings is performed, the equilibrium configurations are acquired and the four point correlation functions, C mc ijkl , are computed; (iv.) these correlation functions are, then, compared to those obtained from the configurations of the validating set, C val ijkl ; (v.) the optimal λ is, finally, taken as the value that minimizes the distance among C mc ijkl and C val ijkl .CV techniques are quite computational demanding and the number of samples used to fit the model and infer the interaction couplings is further reduced because the method also requires a validating set of data.We will show the study and comparison of correlation functions in the original and the inferred systems in Sec.4.3.3.
In this section, we venture into the system to find the optimal value of λ from three different perspectives.We start considering a system with N = 16 spins on ER graph with number of quadruplets, N q = 32.The analysis is shown at T = 2.2, next to the critical temperature T c 2.3.
1.For the first perspective we evaluate the True Positive Rate (TPR), that is the fraction of true bonds appearing also in the inferred set of bonds, i.e., J ∈ J 1 ∩ J * 1 , and the True Negative Rate (TNR), that is the fraction of missing bonds absent also in the inferred set of bonds, i.e., J ∈ J 0 ∩ J * 0 .We, then, look at the minimum λ for which the ratio TNR/TPR is equal to 1, i.e., the network is perfectly reconstructed.It is important to notice that the smaller the λ the less perturbed is the PLF.The entire range of the ratio TNR/TPR vs. λ is shown in Fig. 5.
2. The second perspective is the one which would give the minimum reconstruction error.
To determine how far the inferred values J * q of the distinct quadruplets q ≡ {i, j, k, l} are from the true values J q , we evaluate the reconstruction error: In fig.7 we show the reconstruction error obtained for the λ values show in Fig. 5.
3. For the third perspective, we introduce a new parameter called no-match parameter.
Consider the inferred value, J * q , for the quadruplet q ≡ {i, j, k, l}.By maximizing each L i separately, we have four different inferred values for J * q .The no-match parameter counts the quadruplets for which the result of the hypothesis testing was not the same for the four J * q s.Running the inference scheme for different values of λ gives us different values of the no-match parameter.In Fig. 1 we plot the values of the no-match parameter as a function of λ.We see that as λ increases, the no-match decreases.There is a value of λ(M ) beyond which the no-match parameter remains zero.We consider this as the optimal value of λ.
We stress that, in comparison with the requirements for the reconstruction error to be minimal or the TNR and TPR to be 1, to find λ(M ), for which the no-match parameter is first zero, we do not need any information about the real underlying network.We compare in Fig. 2 the performance of the newly introduced procedure 3 against the other two as a function of sample sizes M .

1 -regularization PLM and estimators variance
Next, we move to the analysis of the variance of the inferred coupling distributions, evaluated as explained in Sec.3.2 by computing the diagonal elements of the inverse of the Fisher information matrix I i jkl,j k l , cf.Eq. ( 26).We consider the same system as earlier, N = 16 XY spins on ER graph with N q = 32.Fig. 3 shows the variance for each coupling estimator.Each 4-index set is labeled by an integer index of quadruplet a ≡ {i, j, k, l}, with a = 1, . . ., N (N − 1)(N − 2)(N − 3)/24.The quadruplet indices are arranged according to the ascending order of the coupling values: the left most are, thus, the σ 2 a 's associated with non-zero couplings with negative value (12 of them in the considered system) and the right most are the σ 2 a 's associated with the non-zero quadruplets with positive value (20 of them).The rest in the middle, are the σ a 's associated with zero couplings (1788 of them).No significative dependence on the index of quadruplet is observed.As expected from the consistence property of the PLM estimator, the σ values decreases as the number of samples increases.In Fig. 4, we show that average value of σs as a function of temperature T .On top of the net decrease with increasing M , we observe that, when T ∼ T c , the σ 2 exhibit a sharp increase.

1 -regularization and decimation PLM: a comparison
In this section, we compare the regularized and the decimated inference schemes using a variety of tests.For illustration, we choose a system constituted of N = 16 nodes.

True Positive and True Negative Rates
For the model with bimodal distributed couplings, in Fig. 5, the ratio TNR/TPR is shown, both as a function of the λ regularizer for 1 and as a function of the number of non-decimated couplings x for decimation.The top figure shows the behavior for M = 1024 at T = 1.9 as a function of λ.Together with 1 we also use δ-thresholding as previously reported.We used δ = 0.1, 0.01, 0.0001.We see that the best result for the multi-decades range of λ examined is given by the 1 scheme with the hypothesis testing technique.For certain high values of λ, we see that the ratio becomes even larger than 1: with a strong regularization too many couplings go to zero.
The bottom plot shows the results obtained with the PLM with decimation.x = 1 corresponds to the first step of decimation, TPR= 1 and TNR= 0. As we decimate couplings the TNR/TPR increases.At x ∼ 0.02, TPR=TNR= 1.This would be the ideal point where to stop the decimation.In Fig. 6, we analyze the difference among x * , the maximum point of the tPLF, Eq. (31), that is actually determined without any knowledge of the graph, and this ideal x.We can clearly see that the best results are obtained working around the critical temperature, here T c ∼ 1. 34.
We can see that as T depart from T c and M is not large enough, x * departs from x.For the worst case, M = 512, we always have TNR<1.In Sec.4.4.2,we will explain how we could reach better performances.

Reconstruction error
Fig. 7 shows the reconstruction error at T = 2.2 for increasing samples size.The error is plotted as a function of λ for the 1 -regularization scheme and as a function of x for the decimation scheme.In all the plots, we see that there is a dip in the error curve: there is  labeled by "L1" indicate that the zero couplings are selected through the 1 -regularization with the hypothesis testing scheme.In the decimation scheme the minimum is given at the ideal x = 32/1820 ∼ 0.18 where also TPR = TNR = 1.In this case, this value coincide with the maximum of the tPLF.
In Fig. 8 we plot the error as a function of temperature, T , and sample size, M .On the left, we show the plot for M = 4096 for a wide range of temperatures ranging from T = 0.5 < T c to T = 6.5 > T c .For the system under consideration, we have N = 16, N q = 2N , bimodal disordered couplings and T c 2.3.On the right, we plot the error as a function of M ranging from M = 512 to M = 8192, for T = 4.3.We see also for this system the already established trends: err J increases rapidly at low temperatures and decimation provides consistently less error than the 1 scheme.Furthermore, the error scales as 1/ √ M for a given temperature.

Correlations
To get a better insight into the physical system that we are dealing with, we investigate the 4-point correlations, defined as The scatter plot in Fig. 9 compares correlations obtained numerically simulating the dynamics of the original system and correlations obtained in a system whose coupling values are those inferred by pseudolikelihood maximization.We present cases for three different temperatures: T = 0.8, 2.57 and 7.03, and, for three different sample size, M = 512, 2048 and 4096.A green color reference line with slope 1 is drawn to compare with the optimal condition.For the high temperature case all correlations are zero, as expected in the paramagnetic phase.For T ∼ 2.6, closer to T c 2.3, the correlations depart from zero and for both the 1 -regularization and decimation schemes are spread along the reference line.
For even lower temperature, T = 0.8, we see that the correlations are separated into two groups: those distinctly different from zero and those close to zero.The decimation scheme yields correlation values nearby the reference line for small M and with increasing M data eventually collapse on the reference line.On the contrary, with points turn out to be distributed below and above the reference line but not along it.As M increases the distance from the reference line tends to zero but much slower than with the decimation method.

Decimation results
We will now focus on the decimation procedure and display further results obtained through it.

Coupling values histogram
In Fig. 10, we plot the histogram of the inferred couplings for a system with N = 32 XY spins, N q = 192 quadruplets on an ER graph.The couplings are generated randomly from a bimodal distribution.Three different sample sizes, M = 8192, 16384 and 65536, at T = 3.3 are considered.The first row shows the histograms as obtained at the zeroth step of decimation.For low number of samples, the distributions of the non-zero inferred couplings are not centered on the true values.As we increase M , this distance lowers.In Fig. 11 show the histograms as obtained when the decimation procedure stops.In these cases, the network of interactions is correctly inferred.Moreover, the non-zero couplings are distributed around the true values for any M .As expected, by increasing M , the variance of the non-zero distributions decreases.
For the case of the complex spherical model, Eq. ( 1), we show an example of results in Fig. 12.Here the system is composed of N = 32 phasors with N q = 2360, couplings follow a bimodal distribution.Results are at T = 7.1 > T c .The figure on the left reports the couplings obtained at zeroth step of decimation.In this case for M < 65536 the distributions of zero and non-zero couplings overlap.When the decimation stops, figure on the right, we can see that the network of interactions is clearly reconstructed for M > 8192.For M = 8192 the algorithm is not able to correctly identify the zero couplings, i. e., TNR < 1: the distribution of the non-zero couplings cannot be clearly identified since a strong overlap with the distribution of zero couplings remains.

Estimating decimation halt criterion
In PLM with decimation, the criterion to stop the decimation is to reach the maximum point of the tPLF.It has been reported earlier [15,29,50], however, that when M is not enough or T is too far from T c , the peak of tPLF is not always sharp enough [50] to be unambiguously identified numerically.We have established an alternative halt criterion for these complicated cases.We consider the relative difference ∆ i in the tPLF as one passes from the network at decimation step i to the one at step i + 1:   In Fig. 13 we plot ∆ i as a function of x.During the decimation procedure, as we proceed further in the iteration, we find that there is a discontinuity in the ∆ i function.This discontinuity appears at the step at which a true coupling is decimated from the system.This point is chosen as new stopping point.The corresponding fraction of non decimated couplings is termed x ∆ .This new stopping criterion comes with its own limitation: if there are not enough data samples available and we are in a very high temperature range, we cannot find   In Fig. 14 we analyze the performance of this new criterion in respect to the maximum  point of tP LF : x M indicates the maximum of tPLF and x m the minimum of the reconstruction error.α is defined as: In Fig. 15, we plot these results for a system of N = 32 spins: in this case we are far from T c and the new criterion outperforms the previous one also for small α.
In Fig. 16, we show how the position x M of the tPLF peak varies varying T and M .The discontinuity of ∆, on the other hand, though it smoothens and eventually disappears as T increases, always stays at the same x ∆ value when it is observabel.In the figure we consider two different temperatures, T = 2.84 and 5.95 both bigger than T c .
In the left hand side we show the results for the entire range of x ∈ [0, 1].In the right hand side there is a zoom in the x region of interest.Around T c and/or for high enough M , the ideal condition, x M = x m = x ∆ , is achieved but x M shifts to higher values for small M and large T .On the other hand, x ∆ = x m until we do not find any discontinuity in the ∆ function.

Pairwise model inference of multi-body systems
What happens when we carry out statistical inference by means of pseudolikelihood maximization if our hypothesis is wrong, i. e., if we maximize with respect to parameters of a wrong model?Are the techniques proposed so far able to identify a wrong theoretical hy- pothesis?To answer these questions we will analyze some of the previous data, generated by a non-linear -multi-body interacting -system under the hypothesis of pairwise interactions.Hence we will consider an Hamiltonian of the form: On the other hand, we take data from Monte-Carlo simulations of the 4 − XY model with N = 16 spins on a ER graphs.For this system there are initially quadruplets in total, among which only N 4 = N c/2 = 32 are actually non-zero.The non-zero coupling values are distributed according to a bimodal distribution.
A few words on how we are going to analyze the results of the inference procedure in this case.Indeed, some of the techniques so far exposed rely on the comparison between the inferred network and the original network (TPR, TNR, err J , . . .).Some others do not imply any knowledge of the original network (no-match parameter, tPLF(x) and its maximum in decimation, ∆ function).To use the techniques that rely on the knowledge of the real graph, we convert the system of quadruplets to a system of pairs by converting each quadruplet to 6 pairs.The so generated 6 pairwise bonds will take the same value of the related coupling J ijkl .Notice that the same pairwise bond, J ij , can pertain to different quadruplets and might thus display different values if the values of the quadruplet couplings are different.As a rule we allow up to 2 pair couplings, out of the 6 related to a single quadruplet, to take a different value.This is enough to build a related pairwise interacting network.

Data analysis
The results are shown for both PLM with 1 -regularization and with decimation.In Fig. 17 we show the reconstruction error as a function of T for various M for the regularization case.For small M we find a similar pattern as in the previous study and a minimum for T ∼ T c is clearly identified.On the other hand, for high M , err J remains constant above a given T .
In Fig. 18, we sort all couplings in descending order.The left plots refer to a system with N = 16 XY spins on an ER graph with N 4 = 32.In the right, N = 16 XY spins interact through N q = 47 quadruplets on a ML graph.In both cases couplings are extracted from a bimodal distribution.The inferred couplings are compared with those of the 2-XY graphs created as described in the previous section (black continuous line in the picture).
Beginning with the top left, we see that at low temperatures the inferred couplings are compatible with a bimodal distribution.Moving to the next panel below, for a higher temperature we see that for larger M the inferred couplings tend to shift more towards zero.This effect is even enhanced in the lowest panel.This trend is observed also in the right column, for the case of ML graph and, once again, it is more evident as the temperature is increased to respect to the critical temperature.
In Fig. 19, we report the results for three values of temperature obtained with PLM with decimation.The tPLF is plotted as a function of the number of non-decimated couplings.We see that at low temperature tPLF curves are overlapping and show a peak at about 100 decimated couplings.Increasing the temperature, the maximum point shifts to lower values.In this case we expect a peak around 20. Increasing the temperature even further, we can see that a clear peak is not even visible and, for high values of M , the tPLF does not show an increment remaining around zero: as M increases the tP LF shows that the degrees of freedom used to parametrize the probability of the system configurations are actually irrelevant.In Fig. 20, we show the reconstruction error.In Fig. 21, we show the comparison between the inferred and the original couplings sorted in ascending order.We find a similar result as with PLM with 1 -regularization.
From the above observations we have the evidence that if we use a wrong Hamiltonian model as a base for our learning analysis the parameters of the wrong model are simply inferred to be zero.In other words, the inference procedure is able to find that the wrong model is wrong, that it is not there.The method does not adjust things to adapt to the wrong model yielding some set of non-zero effective pairwise interactions.This discrimination is effective when the external tuning variable T is above the critical point and its quality increases with the number of data series M employed in the procedure.

Conclusions and future perspectives
In this work, we have presented a deep analysis of the algorithms developed and the results obtained in [50] to solve inverse problems for non-linear continuous spin models.We are motivated by optics: in studying the non-linear interactions among the electromagnetic modes, but the techniques here presented can be applied to a large class of models.In the specific case of non-linear photonic systems, knowledge about the interaction among the modes would give us a proxy to estimate the non-linear optical susceptibility χ (3)  [60].Further, knowing the frequency associated with each mode we could use statistical inference to probe the existence of phase-locking in random lasers or random media with amplified spontaneous emission (ASE).
In a previous work [15], the Pseudolikelihood algorithm proved to give much better performances with respect to mean field methods for continuous spin models.We have concentrated then on different possible approaches and implementations of the PLM estimator.The results showed that the algorithms are able to reconstruct the network of interactions, with higher accuracy close to the critical region, as well as the distributions of the couplings.
We also compared the results of the Pseudolikelihood maximization with decimation and Figure 19: tPLF obtained using data of the dynamics of a 4XY model with N = 16 spins and (left) N q = 32 quadruplets on a ER graph, (right) N q = 47 quadruplets on a ML graph.The PLM algorithm assumes a 2XY model: contrary to the results show in previous sections, as M increases the tPLF peak smooths down until the function becomes almost independent of the number on decimated couplings.with 1 -regularization.We propose a new criterion to determine the value of the regularizer λ for the 1 regularization without any a priori knowledge of the system.We analyze the performances of the methods on data generated through Monte-Carlo simulations.We compare the two methods by: i) analyzing the True Positive Rate and the True Negative Rate that provide information on the correctness of the reconstructed interaction graph, ii) studying the reconstruction error that gives information on the relative differences among the inferred and the true couplings, iii) analyzing how good the inferred couplings are in predicting the dynamics of the system, i. e., comparing the true 4-point correlations with those obtained from the dynamics generated with the reconstructed graph.Further, we proposed and verified a new halt criterion for the decimation procedure that allows to achieve better performance for high T and low M when the tilted PLF does not display a clear maximum.
Other interesting questions are still to be addressed.An interesting deep analysis, requiring a work on its own, concerns the robustness against non-equilibrium.Indeed, all the analysis done here is on thermalized data.Future work will analyze the under-sampling regime  and the issue of extracting extract useful insights from an under-sampled data set.

A Mode-Locking-like dilution of Erdös-Rényi graphs
Our interest in studying the XY and the complex spherical models resides in optics, with the aim of describing the dynamics of interacting electromagnetic modes in lasers.Previous mean-field studies on fully connected models assume narrow-band for the spectrum, [58,59,77,78] that is, all modes practically have the same frequency and, in this way, the frequencies do not play any role in the system behavior.In this section, we will show that non-fully connected graphs can indeed describe the effects of the frequency matching condition intrinsic in modelocking lasers once the existence of finite-band spectra and gain frequency profiles enter in the description.In particular, if the frequency distribution of the modes is known, we will see how it is possible to derive the remaining interacting quadruplets once the FMC is applied.
In general, we could consider diluted graphs obtained from fully-connected graphs with any kind of dilution.In [22,23], the authors compared a homogeneous dilution (HD), in which the quadruplets are eliminated with some probability that is independent on the mode properties, with a correlated dilution (CD), in which the remaining couplings are those among mode quadruplets satisfying the FMC.Firstly, they noticed how to impose the FMC is analogous to introduce a metrics in the problem.Consider the analogy with a random network: the way to construct a graph with a metric is to introduce a distance between different nodes, e.g.d ij = |i − j|, and to choose bonds with a probability depending on such a distance.In the case of four-body interactions, one needs a four index function that can be taken as the FMC.In this way, the mode frequencies play the role of coordinates.Indeed, in Refs.[22,23] it was quantitatively derived that the closest the modes are in frequency, the highest the probability to be neighbors of the same function node, i.e., to participate to the same quadruplet.
As a starting point, let us consider the case of a Fabry-Pérot cavity laser with a flat gain curve 4 .In this case, the longitudinal resonant frequencies are equispaced with dω = 2π/T R , being T R the cavity round trip.We indicate with N in quad the number of initial quadruplets.We will consider cases in which N in quad is smaller than the total number of possible quadruplets, N 4 , but the results here derived can be applied also starting from a fully connected graph.Considering the optical system we have in mind, a first dilution is related to the spatial overlap of the modes: in order to interact the modes have to compete for the same gain medium.For a Fabry-Pérot resonator we expect N in quad = N 4 , since all modes fill the entire cavity.For more complicated geometries we would expect N in quad < N 4 .In general we may have O N in quad < O(N 4 ).Starting from N in , the quadruplets whose mode frequencies do not satisfy the FMC are erased from the graph.We will term the resulting graphs "Mode-Locking" (ML) graphs.
Knowing the frequency distribution of the modes, we can determine the effect of the FMC on the expected final number of function nodes, N out quad .For example, for a multimode Fabry-Pérot cavity we expect a frequency comb spectrum: where ω 0 is some boundary frequency.We consider the realistic case of δω/ω 0 1, being in laser ω 0 in the visible light frequency range and δω in the radio-frequency range.We assume a flat gain curve and one mode for each frequency.For each of the N in quad we have to verify if the modes belonging to that quadruplet satisfy the FMC.Looking at Eq. ( 4), i.e., γ being the typical linewidth of the mode frequency.We notice that among the 24 possible ordering of the indices k 1 , k 2 , k 3 , k 4 in the above expression, can be grouped into 3 independent orderings with 8 equivalent permutations each.Let us consider one ordering among them, which will be indicated with the subindex 1.For example: where we have used Eq. ( 38) in the last step.In order to determine the probability for FMC 1 to be satisfied, P (FMC 1 ), we can evaluate first the probability distribution of n + ij ≡ n i +n j .Considering the case of uniformly distributed frequencies, i.e., P (n) = 1 N for n ∈ [1, N ], we have, with n + ∈ [2, 2N ]: Then, for the probability that left and right hand side of Eq. ( 39) be equal we obtain where we are considering the limit N 1.The same occurs for FMC 2,3 .Eventually, the number of quadruplets satisfying at least one FMC reads, for large N : Imposing the FMC dilutes a network of O(N ).From this result we learn that in order to obtain a ML diluted graph, with a number of links increasing with N α , we need to start with a denser graph whose number of links scales as N α+1 .For instance, in Fig. 22, we show the ML graph obtained starting from an Erdös Rényi graph with N = 8 • 10 3 and N in quad = 76.8 • 10 6 ∼ 1.2N 2 .After imposing the FMC we have 1.2N 2 .The connectivity distribution of this initial graph follows a Poissonian distribution.N end quad = 19219 2N in quad /N and the mean connectivity is c 9.6.We can see that modes with central frequencies have slightly higher connectivity values.The full white line is an histogram of the red data with N ∆ω/δb = 80, being N ∆ω the total frequency range and δb the bin size.The full black line, P (c), gives the probability distribution of the connectivities; it is plotted with the c values on the y-axes to visually enlighten the relation with the red data.N end quad = 2.4N .The dotted red line represents the connectivity as a function of frequency ω.To show more clearly the behavior in ω, the white full line shows an histogram of the red data.As explained above, before applying the FMC, the frequencies do not play any role, e.g., in fully connected models.On the other hand, in the ML graph, we see that the modes with central frequencies have on average higher values for the connectivity.We have analyzed several cases, with different degrees of dilution in the starting graphs and this frequency dependence of the connectivity has been repeatedly observed.In App.B, we analyze more in details this result deriving the connectivity distribution given the frequency of the node.
In Fig. 22, the black line shows the probability distribution of the connectivity, P (c).In Fig. 23 P (c) is plotted on top of a Poissonian distribution with parameter taken from the mean connectivity of the ML graph.We can see that P (c) coincides with the Poissonian distribution, as explained in the next App.B.

B Frequency dependence of connectivity in ML graphs
As anticipated in the previous Appendix, in this section we will show that, even after the FMC is imposed, the distribution of the node connectivities, P end (c(ω)), depending now on the frequency ω of the node, is described by a Poissonian distribution.If we start with an Erdös Rényi graph, like in Fig. 22, the distribution of the connectivities of the initial graph will also be Poissonian, with an average λ larger of a factor N with respect to the ML graph built from it, cf.Eq. (41).The parameter λ of the Poisson distribution is decreased by a factor equal to the probability that, given ω, at least one FMC is satisfied.We indicate the probability with P sat (ω); P unsat (ω) = 1 − P sat (ω) is, on the other hand, the probability that given ω no FMC is satisfied, i.e., that quadruplet is erased from the graph.As a first step, let us evaluate the probability that in a ML graph a node does not participate in any quadruplet, i.e., c(ω) = 0.As before, we indicate with P in (c) the probability distribution of the node connectivity before the FMC is applied.We take the ER Poisson case: with λ = c , independently of ω.For simplicity, we will omit to explicitly write the frequency dependence of P sat (ω) and P unsat (ω) P end (0) = P in (0) + P in (1)P unsat + P in (2)P 2 unsat + . . .

Figure 1 :Figure 2 :
Figure 1: Value of no-match parameter obtained for various values of λ

Figure 3 :
Figure 3: Variance σ 2 of the coupling estimator distribution for each quadruplet at various M for T = 2.2.The system is constituted by N = 16 nodes, hence a total of 1820 quadruplets are displayed, sorted by increasing value of estimated J.

Figure 4 :
Figure 4: Average σ 2 for various M as a fuction of T .For T < T c , σ 2 show a strong increase.

Figure 5 :Figure 6 :
Figure5: The TNR/TPR ratio vs. the regularizer λ for the 1 -regularization PLM (top) and vs. the fraction x of undecimated couplings for the decimation PLM (bottom).In the first case two different criteria are chosen to eliminate small bonds: with an a-priori threshold δ or by means of the a posteriori hypothesis testing procedure, indicated as L1 in the legend.Data are taken from the 4XY model on sparse, ER like graph, with N = 16, N q = 2N .Here, T c ∼ 1.34

Figure 7 :
Figure 7: Reconstruction error vs λ and x for the 4XY model on sparse, Erdos-Renyi-like, graph with N q = N = 16.

Figure 8 :
Figure 8: Reconstruction error versus T for sample size M=4096 (left) and versus M for temperature T=4.4 (right) for the 4XY model on sparse, Erdos-Renyi-like, graph with N q = N = 16.The error obtained following both the 1 -regularized PLMs and the decimation PLM are shown.

Figure 9 :
Figure 9: Comparison of the 4-point correlations computed in a Monte Carlo simulation of a system of N = 16 XY spins, N q = N with the original coupling network (x-axis, J true ) and with an inferred coupling network (y-axis, J inter ).Blue points are correlations measured on networks reconstructed by 1 -regularized PLM, red points corresponds to correlations on networks inferred by decimation PLM.

Figure 10 :
Figure 10: Histograms of the coupling constants J at zeroth step of decimation.Here the system is composed of N = 32 XY spin; the original network has N q = 192.Results are show at T = 3.3 for increasing sample size M = 8192, 16384 and 65536 (from left to right).

Figure 11 :
Figure 11: Histograms of the coupling constants J as obtained when decimation stops for the system of Fig. 10.

Figure 12 :
Figure 12: (Left) Histograms of the coupling constants J at zero step of decimation for the complex SM.The system is composed of N = 32 phasors and N q = 2360.T = 7.1 > T c .(Right) Histograms of the same system of Figure right at the stopping point of decimation.For M = 8192 the zero and non-zero distributions still overlap.

Figure 13 :
Figure 13: ∆ vs x for two different temperatures at low M (512) and at a high M (1024) in the 4-XY model.At low number of samples and high temperature, we cannot find a discontinuity in the ∆ value.

Figure 14 :
Figure 14: Left: Plot of x ∆ − x m (see text) vs T at different M for systems of N = 16 XY spins.Here the possible number of quadruplets is N q = 1820.Right : Plot of x M − x m .

Figure 15 :
Figure 15: Left/Right: Plot of x ∆ − x m /x M − x m vs T at different M for a system of N = 32 XY spin.

Figure 16 :
Figure 16: tPLF (green), ∆ (purple) and err J (blue) versus x.Left: whole range of x.Right: zoom in the range of x close to x m , i.e., the minimum of reconstruction error.

Figure 17 :
Figure 17: Reconstruction error obtained assuming a 2 − XY model hypothesis for a system of N = 16 XY spins and N 4 = 32 quadruplets on an ER-like graph (left) and with N 4 = 47 on a ML-like graph (right).

Figure 20 :
Figure 20: Reconstruction error obtained from 1 -regularization PLM through PLM with decimation for the systems of Fig. 19.

Figure 21 :
Figure 21: Inferred couplings as obtained from PLM with decimation for the systems of Fig. 19 sorted in ascending order.

Figure 22 :
Figure 22: The red dotted line shows the connectivity as a function of frequency of a ML graph.The number of nodes is N = 8 • 10 3 and the number of initial quadruplets is N in quad = 76.8 • 10 61.2N 2 .The connectivity distribution of this initial graph follows a Poissonian distribution.N end quad = 19219 2N in quad /N and the mean connectivity is c 9.6.We can see that modes with central frequencies have slightly higher connectivity values.The full white line is an histogram of the red data with N ∆ω/δb = 80, being N ∆ω the total frequency range and δb the bin size.The full black line, P (c), gives the probability distribution of the connectivities; it is plotted with the c values on the y-axes to visually enlighten the relation with the red data.
) from a ML graph PDF for λ = <c>

Figure 23 :
Figure 23: (b): Probability Density Function (PDF) of a Poissonian distribution with parameter λ = c , dotted blue line, on top of P (c), empty blue circles: the connectivity distribution of the ML graph is still not very far from the starting Poissonian distribution but with mean connectivity decreased by a factor 2/N .