SciPost logo

SciPost Submission Page

Random features and polynomial rules

by Fabián Aguirre-López, Silvio Franz, Mauro Pastore

Submission summary

Authors (as registered SciPost users): Mauro Pastore
Submission information
Preprint Link: scipost_202407_00018v1  (pdf)
Code repository: https://github.com/MauroPastore/RandomFeatures
Date submitted: 2024-07-10 16:51
Submitted by: Pastore, Mauro
Submitted to: SciPost Physics
Ontological classification
Academic field: Physics
Specialties:
  • Statistical and Soft Matter Physics
Approaches: Theoretical, Computational

Abstract

Random features models play a distinguished role in the theory of deep learning, describing the behavior of neural networks close to their infinite-width limit. In this work, we present a thorough analysis of the generalization performance of random features models for generic supervised learning problems with Gaussian data. Our approach, built with tools from the statistical mechanics of disordered systems, maps the random features model to an equivalent polynomial model, and allows us to plot average generalization curves as functions of the two main control parameters of the problem: the number of random features N and the size P of the training set, both assumed to scale as powers in the input dimension D. Our results extend the case of proportional scaling between N, P and D. They are in accordance with rigorous bounds known for certain particular learning tasks and are in quantitative agreement with numerical experiments performed over many order of magnitudes of N and P. We find good agreement also far from the asymptotic limits where D → ∞ and at least one between P/D^K , N/D^L remains finite.

Author indications on fulfilling journal expectations

  • Provide a novel and synergetic link between different research areas.
  • Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
  • Detail a groundbreaking theoretical/experimental/computational discovery
  • Present a breakthrough on a previously-identified and long-standing research stumbling block
Current status:
In refereeing

Reports on this Submission

Anonymous Report 1 on 2024-7-22 (Invited Report)

Strengths

- The authors provide a tight asymptotic characterization of the learning of Random Features Models (RFM) on a random polynomial target function, in various data/width/dimension regimes.

-They outline and identify data (resp. width) limited regimes where the RFM reduces to a kernel (resp. polynomial regression) method, as well as a non-trivial width~data regime, exhibiting in particular an interpolation peak phenomenon.

-The derivation relies on the replica method from statistical physics and several random matrix theory arguments and approximations. All steps are rather clearly justified, motivated, and discussed.

- The analytical findings are supported by convincing numerics.

- The consequences/takeaways of the analytical results are discussed, notably in terms of overfitting and expressive power.

Weaknesses

I am listing a few minor points, typos or questions in the "changes requested" section. Here, I am listing some of my main questions and concerns.

- l. 154 (definition of the teacher). The authors consider a random teacher function, which doesn't allow to investigate simple and natural target functions such as ||x||^2, Hermite polynomials or spherical harmonics. This prevents in-depth connection and comparison to related results on kernel learning, e.g. [22]. How important is averaging over the teacher in the derivation ? [57] (which I am aware is contemporaneous to the reviewed paper, and has a arXiv released after the first arXiv version of the present work) for example seem to be able to accomodate deterministic targets, at least for the learnable polynomial space.

- In the otherwise rather complete related works section, l.122, maybe the works of [Zavatone-Veth and Pehlevan, 2024] and [Schroder et al, 2024] on deep structured RFMs, and that of [Defilippis, Loureiro, Misiakiewicz 2024] on dimension-free characterizations of RFMs could be included. I am aware some of these works are contemporaneous or appeared after the release of the first arXiv version of the present work, though prior to the present submission/version, and would leave the decision to the authors and the editor.

- p.10 : I understood the discussion, but it should ideally be clarified. In particular, could the equivalent model be written in terms of Hermite polynomials instead of Wick products, to connect with related equivalent maps e.g. [22]? Also, adding a short appendix explicitly showing how the features (26) admit population covariance (24), if this is the case, would be helpful.

- Some technical approximations (l.307, l. 243-246, further elaborated in the questions in "requested changes") are merely stated without sufficient discussion. I have not fully understood how these statements are supported, or if they are heuristic assumptions, and feel like further discussion is needed in these passages.

Report

I am overall in favor of acceptance. While I have not carefully gone through every reported technical steps, the overall derivation seems scientifically sound. The question explored is of interest, and my concerns are primarily on some aspects of the exposition of the results, although the overall quality of the writing is largely good and clear.

Requested changes

I am listing below a number of typos, comments, and minor questions.

-l.115 "as long as with finite dimensional outputs"
- l. 187 missing "of"
-l.201 Slightly awkward phrasing, maybe "since x is a test pt, and is thus uncorrelated with ...." would be simpler and clearer.
-l. 226 missing "e"
-l.243-246 Is there any (even non-rigorous) reason to expect the rank to be given by this minimum ? Isn't it in full generality just an upper bound ? More discussion would be helpful.
- Similarly, the statement that off-diagonal elements don't affect eigenvalues/vectors is a bit too fast, and further discussion would be helpful to support this.
- l.343 incomplete sentence.
-l.417 Instead of "overfitting the effective noise", isn't the model rather using the effective noise to overfit the teacher ? The two phenomena are different.
- (55) Though I understand the approximation of neglecting diagonal terms, I am not sure to understand why the l-th Hadamard power of C can be thought of as Wishart ? In particular, it seems the corresponding matrix involved in the product doesn't have Gaussian entries, and the entries are also not mutually independent ? Perhaps more discussion would help.
-l.307 Why are the row spaces assumed orthogonal ? Again, more discussion would prove useful.

Recommendation

Publish (meets expectations and criteria for this Journal)

  • validity: high
  • significance: good
  • originality: good
  • clarity: good
  • formatting: excellent
  • grammar: good

Login to report or comment