SciPost Submission Page
Random features and polynomial rules
by Fabián AguirreLópez, Silvio Franz, Mauro Pastore
Submission summary
Authors (as registered SciPost users):  Mauro Pastore 
Submission information  

Preprint Link:  scipost_202407_00018v1 (pdf) 
Code repository:  https://github.com/MauroPastore/RandomFeatures 
Date submitted:  20240710 16:51 
Submitted by:  Pastore, Mauro 
Submitted to:  SciPost Physics 
Ontological classification  

Academic field:  Physics 
Specialties: 

Approaches:  Theoretical, Computational 
Abstract
Random features models play a distinguished role in the theory of deep learning, describing the behavior of neural networks close to their infinitewidth limit. In this work, we present a thorough analysis of the generalization performance of random features models for generic supervised learning problems with Gaussian data. Our approach, built with tools from the statistical mechanics of disordered systems, maps the random features model to an equivalent polynomial model, and allows us to plot average generalization curves as functions of the two main control parameters of the problem: the number of random features N and the size P of the training set, both assumed to scale as powers in the input dimension D. Our results extend the case of proportional scaling between N, P and D. They are in accordance with rigorous bounds known for certain particular learning tasks and are in quantitative agreement with numerical experiments performed over many order of magnitudes of N and P. We find good agreement also far from the asymptotic limits where D → ∞ and at least one between P/D^K , N/D^L remains finite.
Author indications on fulfilling journal expectations
 Provide a novel and synergetic link between different research areas.
 Open a new pathway in an existing or a new research direction, with clear potential for multipronged followup work
 Detail a groundbreaking theoretical/experimental/computational discovery
 Present a breakthrough on a previouslyidentified and longstanding research stumbling block
Current status:
Reports on this Submission
Strengths
 The authors provide a tight asymptotic characterization of the learning of Random Features Models (RFM) on a random polynomial target function, in various data/width/dimension regimes.
They outline and identify data (resp. width) limited regimes where the RFM reduces to a kernel (resp. polynomial regression) method, as well as a nontrivial width~data regime, exhibiting in particular an interpolation peak phenomenon.
The derivation relies on the replica method from statistical physics and several random matrix theory arguments and approximations. All steps are rather clearly justified, motivated, and discussed.
 The analytical findings are supported by convincing numerics.
 The consequences/takeaways of the analytical results are discussed, notably in terms of overfitting and expressive power.
Weaknesses
I am listing a few minor points, typos or questions in the "changes requested" section. Here, I am listing some of my main questions and concerns.
 l. 154 (definition of the teacher). The authors consider a random teacher function, which doesn't allow to investigate simple and natural target functions such as x^2, Hermite polynomials or spherical harmonics. This prevents indepth connection and comparison to related results on kernel learning, e.g. [22]. How important is averaging over the teacher in the derivation ? [57] (which I am aware is contemporaneous to the reviewed paper, and has a arXiv released after the first arXiv version of the present work) for example seem to be able to accomodate deterministic targets, at least for the learnable polynomial space.
 In the otherwise rather complete related works section, l.122, maybe the works of [ZavatoneVeth and Pehlevan, 2024] and [Schroder et al, 2024] on deep structured RFMs, and that of [Defilippis, Loureiro, Misiakiewicz 2024] on dimensionfree characterizations of RFMs could be included. I am aware some of these works are contemporaneous or appeared after the release of the first arXiv version of the present work, though prior to the present submission/version, and would leave the decision to the authors and the editor.
 p.10 : I understood the discussion, but it should ideally be clarified. In particular, could the equivalent model be written in terms of Hermite polynomials instead of Wick products, to connect with related equivalent maps e.g. [22]? Also, adding a short appendix explicitly showing how the features (26) admit population covariance (24), if this is the case, would be helpful.
 Some technical approximations (l.307, l. 243246, further elaborated in the questions in "requested changes") are merely stated without sufficient discussion. I have not fully understood how these statements are supported, or if they are heuristic assumptions, and feel like further discussion is needed in these passages.
Report
I am overall in favor of acceptance. While I have not carefully gone through every reported technical steps, the overall derivation seems scientifically sound. The question explored is of interest, and my concerns are primarily on some aspects of the exposition of the results, although the overall quality of the writing is largely good and clear.
Requested changes
I am listing below a number of typos, comments, and minor questions.
l.115 "as long as with finite dimensional outputs"
 l. 187 missing "of"
l.201 Slightly awkward phrasing, maybe "since x is a test pt, and is thus uncorrelated with ...." would be simpler and clearer.
l. 226 missing "e"
l.243246 Is there any (even nonrigorous) reason to expect the rank to be given by this minimum ? Isn't it in full generality just an upper bound ? More discussion would be helpful.
 Similarly, the statement that offdiagonal elements don't affect eigenvalues/vectors is a bit too fast, and further discussion would be helpful to support this.
 l.343 incomplete sentence.
l.417 Instead of "overfitting the effective noise", isn't the model rather using the effective noise to overfit the teacher ? The two phenomena are different.
 (55) Though I understand the approximation of neglecting diagonal terms, I am not sure to understand why the lth Hadamard power of C can be thought of as Wishart ? In particular, it seems the corresponding matrix involved in the product doesn't have Gaussian entries, and the entries are also not mutually independent ? Perhaps more discussion would help.
l.307 Why are the row spaces assumed orthogonal ? Again, more discussion would prove useful.
Recommendation
Publish (meets expectations and criteria for this Journal)