Goodness of fit by Neyman–Pearson testing

Gaia Grosso; Marco Letizia; Maurizio Pierini; Andrea Wulzer

SciPost Submission Page

Goodness of fit by Neyman–Pearson testing

by Gaia Grosso, Marco Letizia, Maurizio Pierini, Andrea Wulzer

This is not the latest submitted version.

Submission summary

Authors (as registered SciPost users):

Marco Letizia

Submission information
Preprint Link:	scipost_202307_00009v1 (pdf)
Code repository:	https://github.com/GaiaGrosso/NPLM-GOF
Data repository:	https://zenodo.org/record/7128223
Date submitted:	2023-07-04 19:24
Submitted by:	Letizia, Marco
Submitted to:	SciPost Physics

Ontological classification
Academic field:	Physics
Specialties:	High-Energy Physics - Experiment High-Energy Physics - Phenomenology
Approaches:	Computational, Phenomenological

Abstract

The Neyman--Pearson strategy for hypothesis testing can be employed for goodness of fit if the alternative hypothesis is generic enough not to introduce a significant bias while at the same time avoiding overfitting. A practical implementation of this idea (dubbed NPLM) has been developed in the context of high energy physics, targeting the detection in collider data of new physical effects not foreseen by the Standard Model. In this paper we initiate a comparison of this methodology with other approaches to goodness of fit, and in particular with classifier-based strategies that share strong similarities with NPLM. NPLM emerges from our comparison as more sensitive to small departures of the data from the expected distribution and not biased towards detecting specific types of anomalies while being blind to others. These features make it more suited for agnostic searches for new physics at collider experiments. Its deployment in other contexts should be investigated.

Current status:

Has been resubmitted

Reports on this Submission

Anonymous Report 2 on 2024-2-23 (Contributed Report)

Strengths

1-Wide breadth
2-Interesting and highly up to date topic

Weaknesses

1-Lack in depth
2-Writing needs improvement

Report

The Neyman-Pearson lemma provides a method for determining whether a hypothesis test possesses the highest statistical power, and the NP test is a means of constructing such a test. This lemma is a cornerstone in the field of hypothesis testing. While its application in high-energy physics is not novel, this article surveys a new approach that treat it as a classification problem, thus allowing for the utilization of machine learning techniques. The proof of concept is demonstrated across a wide array of methods and signals.

One limitation of the article lies in the practicality of not knowing the true distributions. While this approach may reduce overfitting, it could potentially introduce higher bias.

Another limitation is that the included code appears to serve merely as examples rather than being packaged for reusable purposes.

Nevertheless, the article covers a wide range of applications in a compelling area of study. However, before recommending it for publication, attention should be given to addressing recurring issues such as
- vagueness in details
- incorrect word usage
- overly complex sentence structures, which hinder comprehension.

This list is not exhaustive, but here are some of the encountered issues.

Requested changes

1-Abstract - "alternative hypothesis is generic enough not to introduce a significant bias while at the same time avoiding overfitting" - what does `generic` here mean? This sentence also can be improved.
2-Abstract - Avoid using abbreviation in the abstract. If this is used multiple times, probably define it here.
3-Abstarct - "and in particular" -> "in particular"
4-Abstract - "emerges from our comparison as more sensitive" -> as the more sensitive test
5-Abstract - Same sentence is very vague can be improved. How "small departure from data"? What "specific types of anomalies"? What "others"?
6-Abstract - "Its deployment in other contexts should
be investigated." Same problem here. What other contexts?

7-Sentence 2 - "general" -> "common/pervasive/recurring/ubiquitous/universal..."
8-Second paragraph - "prime" -> "primary", "if" -> "whether", "prediction for" -> "prediction of". This sentence is also too long.
9-Next sentence - "supplement or replace" -> "supplements or replaces"
10-Same sentence - "daily emerges as a necessity" -> omit "daily"
11-"Studying..." - "need" -> "motive/necessity/requirement..."
12-"such as for instance" -> "such as"
13-"see [1] ..." I would include more reviews here.
14-"... obtained by simulation codes..." -> obtained through simulation.
15-"...tell apart" -> distinguish
16-"... Kolmogorov–Smirnov, or others" - what others?
17-Figure 1 - please make the graph text legible. Label them 1a and 1b and refer as such in the text. The caption should include more details. This applies to other figures as well.
18-p4 - "...according to the standard Gaussian" - "distribution is Gaussian" or "Gaussian distribution"
19-p5 - "Employing expressive models while avoiding overfitting is the standard problem of machine learning research" - Incomplete sentene.
20-p5 - "... and it employs a Learning Machine." - Do you mean it employs machine learning algorithms here? A Learning Machine does not really mean anything.
21-p6 - "NPLM is found to perform better" -> "NPLN outperforms..."
22-p6 - "The first one is that ..." -> First, ....
23-p6 - "Training-test splitting is instead more..." -> "In contrast, training-test spliting is ..."
24-p6 - "as opposite to " -> "as opposed to ..."
25-p6 - "exploiting its prior knowledge." -> "utilizing prior knowledge"
26-p6 - "We consider a “good” GoF method ..." -> "We define a "good" ..."
27-p6 - "a more uniform response is observed " ... "exhibits a more uniform response ..."
28-p10 - " For each toy," -> "For each instance"
29-p10 - "test the method performances to detect" -> "evaluate the method ..."
30-p10 - "are of course still drawn ..." -> "are, of course, still drawn ..."
31-p10 - "it is worth stressing " -> "it is important to emphasize..."
32-p20 - "We presented a first assessment..." -> "We presented/provided an initial assessment"
33-p20 - "by a direct comparision" -> "through a direct comparision"
34-p20 - "by an assessment of the impact" -> "by evaluating the impact"
35-p20 - "Globally better performance ..." -> Does global means overall here?
36-p21 - "One bias that definitely ..." -> "One bias that certainly"
37-p21 - "Since our benchmarks are not " -> "Given our benchmarks are not ..."
38-p21 - "NPLM is instead probably not suited ..." -> "NPLM is likely not suitable ..."
39-p21 - "sensitivity loss of certain methods ..." -> "loss of sensitivity in certain methods"

validity: high
significance: high
originality: ok
clarity: ok
formatting: acceptable
grammar: below threshold

Anonymous Report 1 on 2024-1-16 (Invited Report)

Strengths

1- systematic, comprehensive (for particle physics) study
2- the object of study, NPLM, is a powerful new paradigm for GoF tests
3- overall very clear and easy to follow

Weaknesses

1- needs quite some more work on the language (will enlist under "Requested changes")

Report

The NPLM class of algorithms introduced a powerful new paradigm for goodness-of-fit (GoF) and two-samples (2ST) tests. This paper is the first to systematically compare it with classical GoF methods, in realistic, physically meaningful settings.
The results show the immense power -- in the statistics sense of the word -- of the NPLM methods. More specifically, NPLM seem to be a particularly robust choice for a GoF fit, as it performs superbly in almost all scenarios studied.
The paper thus clearly fulfills SciPost's acceptance criteria. However, quite a few issues with language need to be resolved first. Also, I have a few requests to further elaborate on some explanations, as some details are not entirely clear to me, yet.

Requested changes

1- abstract, line 6: initiate -> conduct
2- Introduction, line 1 (and throughout the text, will not repeat): samplings of a random variable -> samples of a random variable
3- p2 Classifier-based goodness of fit, line 5: "in most applications" feels too strong to me. -> "in many applications"
4- p2 Classifier-based goodness of fit, "and it is created within the design of the test methodology": I have no idea what that means.
5- one line below "by running more simulations" -> "by synthesizing more artificial data" (or similar)
6- p3 "Interpreting the GoF problem as a one of 2ST" -> "Interpreting GoF as a 2ST"
7- sometimes references are given as "[x]" sometimes "Ref. [x]". Does the journal propose/wish for a consistent manner?
8- p3 "to tell apart the D from the R data" -> "to tell D from R"
9- "On the contrary, it will possess" -> "It will however posses ... "
10- next sentence: "The performances of the trained classifier" -> "The performance of the trained classifier"
11- end of paragraph: "In computer science instead," -> "In computer science however," (or even just drop the instead)
12- Neyman-Pearson testing: A third distinct statistical problem is the one of", "is the one of" sounds wrong to me. -> Hypothesis testing as formulated by Neyman and Pearson poses a third distinct statistical problem.
13- "by associating to the data D some probabilistic quantity" -> "by assigning some probabilistic measure to the data
14- "which is instead absent" -> "which is however absent"
15- throughout the paper: "test statistic(s) variable" -> "test statistic". drop the "variable".
16- p3 "which in turn defines the p[D] association" -> "which in turn defines the p-value"
17- p3 next sentence: "errors rate" -> "error rates"
18- p3 next sentence "in comparison with the aleternative H1" (drop the with the agreement)
19- p3 "inside the family" -> "within the family"
20- p4 second paragraph: "On the contrary, the best fit ..." -> "However/On the other hand, the best fit ..."
21- actually I think it would have been slightly easier to read, had you given the description of D1 and D2 earlier on, not when describing the results.
22- p5 line6 "regardless of whether the true distribution is R or it is different" -> "regardless of whether the true data distribution is or is not R"
23- p5 next paragraph "based on literature from the 1920s"
24- p5 same paragraph "the numbers of counting in each bin" -> "the counts in each bin". Also in the remaining text, "[number of] counting(s)" -> "count(s)", please check.
25- p5 next sentence "with arbitrary expected" -> "with arbitrary expected values" or "with arbitrary expectations". Please also check the remaining text for the same issue.
26- p5 In particular it is *often* used in high energy physics ....
the choice of binning is problematic and *subject* to the same type ...
27- p5 NPLM paragraph: In particular the functions f_W are conveniently taken to parametrise the log ratio of the H_W and R *likelihoods*
28- p5 last paragraph "including toy or realistic problems of new physics detection" -> "including toy as well as realistic new physics searches"
29- p5 last paragraph. "However, no systematic comparison has yet been performed"
30-p5 last sentence: "Such a comparison is the goal" (Drop the "initiating", or replace with "Conducting")
31- p6 "with which it shares such strong similarities", I would say either features/characteristics are shared, or similarities are exhibited, but maybe that's only a minor semantic detail.
32- p6 same paragraph "that have been considered for the comparison" drop "the". rephrase the second part "and not exposed to strong ... "
(i propose to start a new sentence)
33- next paragraph "and based on the following logic" -> ", based on the"
34- same paragraph: ", a quantity that has no *direct* (intrinsic, immediate) interpretation as a classification metric"
35- p6 third paragraph, the comparison with 1d-gof for me feels not out of line with the remaining paper, jfyi
36- next paragraph you "draw conclusions on the implications on the merits of NPLM", wait what? what are you guys, germans? can you say this in simpler words? also: "in the landscape" -> "within the landscape" (depends though on how you reformulate)
37- same pargraph "risks to depend strongly" -> "may strongly depend"
"and *a* few new ones"
" as opposite to" -> "as opposed to"
"to anomalies of different type*s*"
38- 2.1 "The alternative hypothesis H_w ... is constructed/defined (not postulated) as".
39- p7 middle of the page "The expected of this Poisson distribution" -> "The expected (expectation) *value* of this Poisson distribution"
40- you write eq. (1), but Figure 1, I am sure SciPost has some recommendation on this.
41- p7 "and in turn on" -> "and thus on"
42- p7 "to deal with the setup where" -> "to deal with cases of fixed total numbers of data points" (or sth similar). "When dealing with this case" -> "In these cases, one simply replaces ..."
43- p7 "the log ratio between the distribution" -> "the log ratio of the likelihoods of ... "
44- p8 first sentence "test statistics variable" -> "test statistic" (and a few more occurrences later on) "that is twice the negative logarithm of the ratio of the R and H_hat(W) likehoods. (Formulating as -2ln H0/H1 feels slightly more standard to me than 2ln H1/H0, but that might be just me. eq 5 would change though, also).
45- p8 like for any test of hypothesis -> like for any hypothesis test
46- p8 logisitc -> logistic
47- p8 "to tell apart D from R" -> "to tell D from R"
48- p8 "to the discussion of the NPLM peculiarity" -> "to the discussion of NPLM's peculiarities"
49- p10 making *it* increasingly difficult to detect
50- Figure 2 (left): maybe drop the first "gauss" in the labels? increase font.
51- Figure 3: a lot of new acronyms appear in this plot, MLR=maximum likelihood ratio? BAL = balanced? ACC = ? i think some of them can simply be dropped.
52- p11 by running *the* training on *a* few R-distributed ...
53- p12 "hence the the total number of" drop one "the"
54- p13 We use *the* Adam optimiser
55- p14 the presence of *a* few signal events
56- p14 "In the non-discrepant regions instead" -> "However, in the non-discrepant"
57- p15 "that weights the same all ..." -> "that weights all ... the same"
58- p16 footnote 5 "to give the same performances" -> "to result in a comparable performance"
59- Figure 7: how does the ndf=96 come about? I guess the computation is 5(input)x20(hidden) - 4, why 4? Maybe explain in a short sentence.
60- p17 "for the NPLM method*'s* sensitivity"
61- p17 This is of course equivalent to *running* ...
62- Figure 7: whats up with EFT? Why is the power always at 100%?
63- p17 "all what there is to be learned" -> "all there is to be learned"
64- throughout: i think i would replace the word "habitual" by the words "default" / "standard"
65- Paragraph 3.2: Ha, now I am confused. So far I had thought the process was that you learn the (maximum-)likelihood-ratio with the NNs as alternative hypotheses, you use softmax/logistic_function to translate - 2 ln H0/H1 into a classifier, then use e.g. the power of the test to evaluate the performance. It is the likelihood ratio that (or so I thought) you called the test statistic. Now you are talking about replacing the likelihood ratio with e.g. the classification accuracy? Or is it still the likelihood ratio principle, but you employ the classification metrics on top? ON p18, you write, "We evaluate its impact on the performances ... that employ different test statistics variables to be evaluated on the trained classifier". For me it was the exact opposite, the classifier is a function of the test statistic, not the other way round?? I dont anymore know what is called a test statistic, what is a classifier, and what is a classification metric. Please help me out here.
I am here also unsure as to how wide a class you call "NPLM". I thought that defing feature of NPLM was the likelihood-ratio test with an "empirical" model (e.g. an NN) as the alternative, but now I am not anymore sure.
66- p19 "display in the first place a considerable dependence", whats the meaning of "in the first place"?
67- p20 "to tell apart the reference hypothesis from one ... " -> drop "apart"
68- p21 with new physics at collider*s*
69- p21 the reference and the alternative distrubtion*s* are on the same support and the discrepanc*ies* emerge either ...
70- p21 footnote 6: This because -> This *is* because
71- p21 "Challenges stem from ... or that it can be meaningfully stated ..." can you please make this sentence more readable and less colloquial?
72- appendix a, p22 "Therefore, it must be justified to
the extent that possible taking into account that" not english, please rephrase
73- p22 "is which level of sensitivity it is legitimate to" drop the "it"
74- p22 "is so close to the reference distribution not to produce" -> "close enough for it not to produce" (maybe there is a better way to say it)
75- p22 "according to some notion of “easiness” that has to
be defined." -> does it have to be defined, though? according to some difficult to define notion of "easiness"
76- next sentence, such *a* notion
77- p22 For instance, sharp peaks (drop the s in sharps)
78- p22 "more abstract notion is needed in *order* to compare"
79- p23 "from the powerful result known as the Neyman–Pearson lemma" -> from the powerful Neyman-Pearson lemma, which identif*ies*, . ... that is the most sensitive, defined as the most powerful at any given test size
80- p23 The sensitivity of such *an* optimal
81- p23 any test of hypothesis -> any such hypothesis test
82- p23 should be a one that responds -> drop the "a"
83- p23 "Such type*s* of distributions"
84- p23 "and an area of 10 events" -> "and a total of 10 events". This information is actually given twice, see the following sentence.
85- p23 is not a discriminant variable -> is not a discriminating variable
86- p24 bottom, maybe refer the reader to the right reference for an explanation of M, sigma, lambda
87- p25 an agnostic search, what, the search doesn't believe in god? -> model-agnostic
88- p25 second paragraph The data sets .. consist .., produced in the final state*s*
89- p25 "The SM theory also predicts
the total number of events with two muons that are expected to be observed after a certain number of protons have been collided. More precisely, the number of expected events is proportional to the integrated luminosity of the LHC collider that is employed in the analysis. By varying the data luminosity that we decide to employ, we can thus control the total number of expected events N(R) of our GoF case study." this seems an unnecessarily complicated way of saying things. rephrase.
90- p25 "a 60 GeV lower cut selection" "cut" and "selection" are redundant here, a cut is a selection. I propose that in this paragraph you always write variable-operator-value for the cuts, e.g. pT>60 GeV, etc.
Might make it easier to read.
91- p25 "are physical law not" -> "are phenomenta not ... "
92- p25 "but by some BSM theory" -> but by an arbitrary extension
93- p25 "Simple BSM benchmark theories" -> "Simple BSM benchmark points"
94- p26 "in this case because the distribution of the reference and of the
alternative hypotheses is not available in closed form." -> ".. because the distrubtions .... are ... "
95- Figure 9, make the cross in legend same as cross in plot. Perhaps plot the meridian to guide the reader's eye. If you dont comment on the funky "error bars" in the caption, then please refer to the text.
96- p27 "This prevents in practice to estimate the sensitivity" -> "This hinders a sensible estimate of ... " or sth else less colloquial.
97- p27 "and the results consists for of a
lower bound and a rough estimate based on Gaussian extrapolation as previously described." not english, please rephrase.
98- p27 "weight clipping value set at 2.15." -> "weight clipping set to 2.15"
99- p28 "is reduce to" -> "is reduced to", but harmonise singular/plural in the sentence (the voltages ... are reduced to their nominal values )
100- Eq. 21, a square is missing in the numerator
101- p29 "free parameter of the of the test." -> drop "of the"
102- p29 "distribuyion" -> "distribution"
103- p29 "counting number of instance*s*"
104- p29 "but rather to the integral of the number" -> drop the "to"
105- p30 "These tests are constructed
by first transporting the data points" -> use "mapping" instead of "transporting". "uniformly distributed *under* the R hypothesis". The *map(ping)* is defined ...
105- Figure 10,13: whats the meaning of IN and OUT?

validity: high
significance: high
originality: high
clarity: high
formatting: excellent
grammar: acceptable

SciPost Submission Page

Goodness of fit by Neyman–Pearson testing

by Gaia Grosso, Marco Letizia, Maurizio Pierini, Andrea Wulzer

This is not the latest submitted version.

Submission summary

Abstract

Current status:

Reports on this Submission

Anonymous Report 2 on 2024-2-23 (Contributed Report)

Strengths

Weaknesses

Report

Requested changes

Anonymous Report 1 on 2024-1-16 (Invited Report)

Strengths

Weaknesses

Report

Requested changes

Login to report or comment