SciPost logo

SciPost Submission Page

How to escape atypical regions in the symmetric binary perceptron: a journey through connected-solutions states

by Damien Barbier

Submission summary

Authors (as registered SciPost users): Damien Barbier
Submission information
Preprint Link: https://arxiv.org/abs/2408.04479v2  (pdf)
Date submitted: 2024-08-23 13:49
Submitted by: Barbier, Damien
Submitted to: SciPost Physics
Ontological classification
Academic field: Physics
Specialties:
  • Statistical and Soft Matter Physics
Approaches: Theoretical, Computational

Abstract

We study the binary symmetric perceptron model, and in particular its atypical solutions. While the solution-space of this problem is dominated by isolated configurations, it is also solvable for a certain range of constraint density $\alpha$ and threshold $\kappa$. We provide in this paper a statistical measure probing sequences of solutions, where two consecutive elements shares a strong overlap. After simplifications, we test its predictions by comparing it to Monte-Carlo simulations. We obtain good agreement and show that connected states with a Markovian correlation profile can fully decorrelate from their initialization only for $\kappa>\kappa_{\rm no-mem.\, state}$ ($\kappa_{\rm no-mem.\, state}\sim \sqrt{0.91\log(N)}$ for $\alpha=0.5$ and $N$ being the dimension of the problem). For $\kappa<\kappa_{\rm no-mem.\, state}$, we show that decorrelated sequences still exist but have a non-trivial correlations profile. To study this regime we introduce an $Ansatz$ for the correlations that we label as the nested Markov chain.

Author indications on fulfilling journal expectations

  • Provide a novel and synergetic link between different research areas.
  • Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
  • Detail a groundbreaking theoretical/experimental/computational discovery
  • Present a breakthrough on a previously-identified and long-standing research stumbling block
Current status:
Awaiting resubmission

Reports on this Submission

Report #2 by Anonymous (Referee 1) on 2024-11-29 (Invited Report)

Report

In the present paper, the author studies atypical solutions in the binary symmetric perceptron model to understand why some solutions to the problem are algorithmically accessible, even if typical solutions are isolated and unreachable.
The question the author tries to answer is very relevant and still largely open.
The work seems scientifically solid, but the presentation is challenging to follow and should be improved.

The method proposed by the author to count atypical solutions is rather complicated as it requires the optimization of a potential over a large number of parameters.
So let me start with a naive question, which I believe is needed to understand why the author is doing such a challenging computation.
Given that atypical solutions reached by the Monte Carlo (MC) algorithm implemented by the author have a very specific distribution of margins P(w), would it be possible to count such atypical solutions and check whether they are less isolated than typical ones or even connected?
We know very well that biasing the measure towards atypical solutions can change the phase diagram of a constraint satisfaction problem and the corresponding algorithmic thresholds (see, e.g., Budzynski et al., J. Stat. Mech. (2019) 023302).
The author should comment on this kind of computation both if this has already been done (and in that case, should report the results) and if it is still missing (and in that case, should explain why not doing it instead of building complicated chains of solutions).

Observing the data reported in Figure 4 (left), the author states that the MC dynamics require superlinear (in N) times to go below the correlation predicted within the no-memory ansatz (roughly 0.9). The author insists a lot on this point, repeating it many times throughout the paper. The problem is that the data shown in Figure 4 (left) do not support the author's claim: below the value 0.9, the decorrelation is becoming faster (and not slower) by increasing the system size. Indeed, the largest size (N=40000, yellow curve) shows the faster relaxation. The author should better explain his claim or modify the claims throughout the paper, making it in agreement with the data reported.

Moreover, I don't believe that considering superlinear times is really needed to describe that regime of correlations. The author shows that the no-memory ansatz is unable to enter that regime, but probably, one should just consider solutions with a different type of correlation, which can still be reached in linear times. An illuminating example could be the one discussed in Ref. [33]: in the xorsat model, the simplest ansatz describes well the relaxation dynamics until an energy value close to 0.1 (see Fig. 6 of Ref. [33]); below that value, a more complicated ansatz is needed, but the times to reach those energy values are still linear in the system size N.
So, I think the author should decouple the breaking of the validity of the no-memory ansatz to the appearance of superlinear times (otherwise, he should provide a piece of evidence that the present version of the paper is lacking).

In Figure 6, when commenting on the distribution of margins P(w) in configurations reached by the MC, the author claims that the difference from the truncated Gaussian distribution extends over a distance $\sqrt{1-m^2}$ from the edge. It would be useful to show analytical evidence for such a claim.

One of the main results of the paper is presented in Figure 7, where the author shows that a nontrivial region of $\kappa$ values can be explored through delocalized no-memory chains. Given the importance of the result, I suggest the author to present it also for other values of $\alpha$ and other choices of m. Restricting to $m=1-2/N$ seems very limited. Once the threshold is computed for several values of m, one could, in principle, optimize over this parameter to maximize the delocalized phase. If this is not doable or not useful, the author should explain why.

If the no-memory computation can be done with different m values (e.g., with $m=1-c/N$ for several values of c), I would appreciate a comparison of the MC data with the solution predicted by the no-memory ansatz with different m values. For example, the MC data shown in Figure 8 look like a distribution of margins in the no-memory ansatz but with a smaller value of m (i.e., a larger value of c). Making the comparison between MC data and the no-memory ansatz only with $m=1-2/N$ is very limited.

Section 6 of the manuscript is very hard to follow. The reader will probably get lost in very long formulas, and the messages the author would like to convey are at risk of being missed. I suggest the author reorganize the section by moving technical details in the appendix and concentrating on explaining the main results.

In my opinion, section 6, reporting the results obtained with memory ansatz, should provide evidence that the analytical prediction is closer to the MC results. Given that the analytical computation with this new ansatz is much more demanding, the author must provide convincing evidence that it is worth making this extra effort. At present, the only evidence that the memory ansatz is doing better than the no-memory ansatz comes from Table 1. Moreover, in that table, the result for k=1 (no-memory) is obtained, if I understand correctly, with m=0.98 and not with $m=1-2/N$. Can the k=1 threshold be optimized by changing the value of m?

The suggestion of comparing the memory ansatz with the no-memory ansatz optimized over the m parameter is further supported by the results reported in Figures 10 and 11.
In Figure 10, the decay of the correlation in the memory ansatz is slower than in the no-memory ansatz but looks still like an exponential decay with a larger m value.
In Figure 11, the P(w) computed in configurations reached through the memory ansatz is very close to the one of the no-memory ansatz and probably very far from the P(w) obtained via the MC. Data from MC simulations should be added to Figure 11 (at least in the right panel) to provide evidence about how good the proposed ansatz is in describing the actual MC dynamics.

I understand that the optimization of the parameters in the memory ansatz is complicated, but the author should provide some information about the optimal parameters obtained at the end of this process. This would be important to understand the differences with respect to the no-memory ansatz.

Many figures contain fonts that are unreadable when printed. This must be avoided.

Minor points:
- Between equations (4) and (5), the definition of margins lacks the right normalization.
- On page 18 the comment "(conversely m->0)" is obscure to me. "Conversely" to what?

Recommendation

Ask for minor revision

  • validity: good
  • significance: high
  • originality: high
  • clarity: low
  • formatting: good
  • grammar: good

Report #1 by Anonymous (Referee 2) on 2024-11-26 (Invited Report)

Report

In this manuscript, the authors introduce a method to explore the solution set of the symmetric binary perceptron.
Their formalism introduces a chain of solutions (anchored on a planted solution x_0), allowing them to explore atypical connected regions of solutions.
Parts 1 to 4 introduce the chain formalism, explain the need for a simplification (no-memory anzatz) assuming that solutions at distance larger than 1 on the chain deccorellates.
Part 5 confront theoretical results obtained in previous sections (within no-memory anzatz) with numerical simulations.
Part 6 builds the next steps to go beyond the no-memory anzatz.
The paper is well written and present strong results. It is solid scientifically and the conclusions are well supported. I recommend it for publication.

I have some comments/suggestions that I think could improve the comprehension of the manuscript:

- eqs (32) and (33): maybe you should give more explanations on these inequalities, in particular explain why do we need to distinghish the cases K_t>K_{t-1} and K_t<=K_{t-1}.
For eq (33), if I understand correctly, the local entropy for m_1 < m <= 1 is positive because x_t solutions are less constrained (K_t>K_{t-1})?
But is there an intuitive explanation why there is still a range [m_0,m_1] for which V<0 ?
Also, when restricted to one case (e.g. K_t>K_{t-1}), how do the values m_0, m_1 change when changing \kappa_t, \kappa_{t-1} ? If so, are there choices of K_t,K_{t-1} for which m_1=m_0 ? Do the values m_0,m_1 also depend on the previous choices of \kappa_j , j<t-1 ?

- eq (29): From line 1 to line 2, the term with the integral over w_{t-1} is removed, how do you justify it?

- also, I couldn't find in the main text a definition of the large size limit that is usually taken for making the replica computation. Sometimes, I found that it is a bit unclear what are the size-dependent parameters (overlap constraint $m$, margin $\kappa$, density of constraints $\alpha$, and what are the constant ones. Since in section 4.2 the author justifies its choice of a non-trivial scaling for the magnetization $m$, I think it would be useful to first explain in a few words what is the usual setting?

- eq (17, 18) and below: I might have missed the point, but I don't really agree with the justification for the iterative procedure fixing the m's and mhat's.
Even if the x_1,x_2,...x_t are built recursively, it doesn't mean that the optimization of V_{t} over all parameters simultaneously ({m_{j,j'}, mhat{j,j'}) is equivalent to optimizing them one by one looking at V_{t=1}, V_{t=2}, especially because I dont see a clear recursive relation between V^*_{t} and V^*_{t-1}.
In other words, the optimal \{m_{j,j'} 0<=j'<j<=t)\} for the chain of length $t$ might be different from the optimal m_{3,1} for a chain of length 3, and the optimal m_{4,1},m_{4,2} for a chain of lenth 4...
Are you describing here an aproximate method to get the optimized values ? If so, I think you should mention it.

- still in the same paragraph, the sentence 'To put it more concretely, we set mhat_{1,0} and m_{1,0} by optimizing V^*_{t=1}. Then, we fix mhat_{2,0}, mhat_{2,1}, m_{2,0} and m_{2,1} by optimizing V^*_{t=2} (and so on and so forth).' But m_{1,0} and m_{2,1} are parameters of the potential V, so one shouldn't optimize them. Is it a typo ?

- section 4.1: Since you have the full expression (17), did you checked numerically (maybe only for t=2, t=3) that the no-memory anzatz was justified, by computing the hat{m}_{j,j'}, |j-j'|>1 ?

- section 5.1: for a given run, the mapping f: t_rescaled -> t might also change along the run ?
In other words, I think that flipping f(N) spins might require a different time depending on whether we are at the beginning or at the end of the Monte-Carlo run, so how could you justify that you take a mapping that remains the same along the run ?
On fig. 3,4,5, could you justify the choice made for the rescaled time function f ?

Other remarks:

- Figure 2: I suggest using a larger font for writing the means.

- In section 1.2, paragraph 'Solutions chain formalism', the sentence "However, when taking the limit m_{j,j−1}->1, it can be shown that an anneal ed evaluation of the entropy becomes exact."
It is not clear at first sight that this is actually shown in appendix A. Could you make the link with the appendix, so the reader is not confused ?

- eq (24): might be useful to remind to the reader that the definition of the function H is given in (10) ?

- section 5.3: when describing fig. 3 (in the main text), you mention both times the right panel, I guess the second time (when talking about the plot with rescaled time) 'right' should be replaced by 'left'

- section 5.1, typo: the matching function is called g : t_{rescaled} -> t in the main text, but it is called f in the figures.

Recommendation

Publish (surpasses expectations and criteria for this Journal; among top 10%)

  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Login to report or comment