SciPost Submission Page
Trials Factor for SemiSupervised NN Classifiers in Searches for Narrow Resonances at the LHC
by Benjamin Lieberman, Andreas Crivellin, SalahEddine Dahbi, Finn Stevenson, Nidhi Tripathi, Mukesh Kumar, Bruce Mellado
This is not the latest submitted version.
Submission summary
Authors (as registered SciPost users):  Benjamin Lieberman 
Submission information  

Preprint Link:  https://arxiv.org/abs/2404.07822v2 (pdf) 
Date submitted:  20240628 11:15 
Submitted by:  Lieberman, Benjamin 
Submitted to:  SciPost Physics Core 
Ontological classification  

Academic field:  Physics 
Specialties: 

Approaches:  Computational, Phenomenological 
Abstract
To mitigate the model dependencies of searches for new narrow resonances at the Large Hadron Collider (LHC), semisupervised Neural Networks (NNs) can be used. Unlike fully supervised classifiers these models introduce an additional lookelsewhere effect in the process of optimising thresholds on the response distribution. We perform a frequentist study to quantify this effect, in the form of a trials factor. As an example, we consider simulated $Z\gamma$ data to perform narrow resonance searches using semisupervised NN classifiers. The results from this analysis provide substantiation that the lookelsewhere effect induced by the semisupervised NN is under control.
Author comments upon resubmission
We sincerely thank the referees for their thorough and thoughtful review of our paper. Their insights are invaluable to us, and we have addressed each of their comments in detail in the following response.
Kind Regards,
Benjamin Lieberman and authors
List of changes
Response to Referee report 1 Comments:
Comment 1: The introduction of the article could benefit from a discussion on experimental new physics searches utilising unsupervised machine learning methods, as well as a more comprehensive explanation of the distinctions/similarities between the proposed semisupervised technique and other (semisupervised) methods used for anomaly detection. The conclusions should be rewritten to reflect these considerations.
Furthermore, clarification is needed regarding the assertion that the proposed method is less modeldependent than other methods, especially semisupervised or unsupervised ones.
Response 1:
We have improved the introductory paragraphs introducing the proposed semisupervised technique. This includes the introduction and comparison of unsupervised, semisupervised and weakly supervised methods with corresponding added references. Furthermore, we have added a discussion comparing the extent of model dependence. Although clarification that the proposed method is less modeldependent than alternative semi or weakly supervised would provide a valuable study, it is largely dependent on the specific signal, background and region of interest and is not the focus of this study.
Comment 2: The manuscript illustrates its methodology through resonance searches in the 𝑍𝛾 final state. While the choice of this example is motivated by existing anomalies in LHC data, care should be taken to streamline the referencing and ensure clarity and conciseness in the justification. Consideration should be given to replacing the second paragraph on page 3 with a succinct statement detailing the rationale behind choosing the 𝑍𝛾 analysis as an illustrative example of the proposed methodology. The heavy reliance on 16 selfcitations out of 21 references, which exclude many relevant experimental papers, is in my opinion unnecessary in light of the actual topic of the present manuscript.
Furthermore, it is essential to accurately characterise the origin of these anomalies. Properly distinguishing between those confirmed by LHC collaborations and those proposed by phenomenological works, which may lack access to comprehensive statistical treatment, is crucial.
In addition, as written above, including other illustrative examples based on standard resonance searches in dijet, diphoton or dilepton final states, would be beneficial for readers.
Response 2: In our analysis we selected the 𝑍𝛾 final state, motivated by the multilepton anomalies as an ideal showcase for an analysis of semisupervised learning for narrow resonance with topological requirements. Although 𝛾𝛾 is similarly motivated by the multilepton anomalies, we selected 𝑍𝛾. The methodology and results stand independent of the anomaly at 152GeV and use it only as a showcase. Therefore even if 152GeV excess goes away, the paper will still stand as a showcase. We have removed unnecessary selfcitations and added relevant references from the LHC collaborations to better substantiate our motivation.
Comment 3: Section 2.1 lacks sufficient information on the simulation toolchain used. Event generation for the 𝑝𝑝→𝑍𝛾→ℓℓ𝛾 process seems to enforce the intermediate 𝑍boson to be onshell. However, since the mass window in 𝑚ℓℓ𝛾 is large enough, offshell 𝑍 contributions, virtualphoton contributions, and their interference are relevant. It remains unclear whether they have been properly accounted for.
Furthermore, the discussion on the chosen parton density set is unclear. It is essential to clarify whether nexttoleadingorder matrix elements have been consistently convolved with nextleadingorder parton densities, and not leadingorder ones.
Finally, the text does not clearly distinguish between generatorlevel cuts and reconstructedlevel cuts that are implemented in the simulation. Providing a clear delineation between these sets of cuts is crucial. Additionally, Section 2.1 should include details on preselection criteria, like cuts on the number of leptons and photons, that are currently not discussed.
Response 3:
Our simulation accounts for the fact that the ℓℓγ invariant mass of 130 to 170 GeV necessitates the Z boson to be offshell. We configured MadGraph to include offshell Z boson contributions, although a prompt photon is considered. The event generation utilized nexttoleadingorder (NLO) parton distribution functions (PDFs) convolved with NLO matrix elements, ensuring consistency and accuracy in the simulation. Generatorlevel cuts were applied during the event generation in MadGraph, specifically imposing an invariant mass cut on the ℓℓγ system to select events within the 130170 GeV range. These generated events were then processed through PYTHIA for parton showering and DELPHES for detector simulation, where reconstructedlevel cuts were applied to mimic experimental conditions. Reconstructedlevel cuts included detailed preselection criteria such as the number of leptons and photons, their transverse momenta, and isolation requirements. Furthermore, overlap removal procedures were implemented to avoid doublecounting objects. This involved removing jets that were too close to leptons or photons to ensure distinct identification of each particle. These updates, along with more detailed descriptions, have been included in the revised version of the paper.
Comment 4: The manuscript should define central jets and specify the associated pseudorapidity cut.
Response 4: We have added the definition and associated pseudorapidity cut for central jets to Section 2.1. Monte Carlo Simulation.
Comment 5: Figures 1 and 2 should be adjusted to improve readability. The missing transverse energy spectrum could be presented with a log scale or a reduced domain to enhance clarity. Additionally, in figure 2, all eight lower insets should indicate whether they refer to the sideband or signal mass window. In fact, consideration should be given to showing both these curves.
Response 5: We have updated Figures 1 and 2 to improve readability. Firstly the domain of the missing transverse energy spectrum, the number of jets and the number of central jets have been reduced. Secondly the overlap between the plots and the legends has been removed. Finally, in Figure 2 the lower insets have been updated to include the relative difference for both the side band and mass window categories with corresponding legends.
Comment 6: The caption of figure 4 should define the acronym 'BR' for clarity.
Response 6: We have added a definition for the acronym ‘BR’ in the caption of Figure 4 for improved clarity.
Comment 7: In Section 3.3, the manuscript should avoid using the term 'centre of mass' to refer to the 'center of the signal mass window', as 'centre of mass' has a different welldefined meaning.
Response 7: To avoid misinterpretation we have replaced our use of ‘center of mass’ with ‘center of the signal mass window’ as suggested.
Comment 8: It would be instructive to assess the impact of background systematics on the calculation of local significance, especially considering the incomplete background modeling acknowledged by the authors in Section 2. Equation (3) should be generalised accordingly, and the results of Section 4 updated subsequently.
Response 8: The study focuses on measuring the look elsewhere effect. As background systematics are additionally applied and conventionally applied elsewhere it has been factored out of this study and can be applied after. The background systematics must therefore be applied additionally to this case study before it can be applied to a specific excess. The background systematics are obtained by using different fitting functions which however does not impact the DNN part of the analysis.
To make this clear to the reader we have added a footnote (“Note that the systematics related to the background fitting functions, i.e. the spurious signal analysis, are not included here. However, this uncertainty should be included on top of the additional lookelsewhere effect studied here and is not related to the use of NNs.”) following Equation 3.
Comment 9: The bibliography should be carefully proofread to correct any errors. Specifically, attention should be given to identifying and correcting duplicate references (like references [2] and [3]), updating references that are now published (like reference [23]), and ensuring insertion of complete references (like [45] and [46]).
Response 9: Firstly, we apologize for the extent of error in the bibliography. We have removed duplicate references, updated references to now published articles, and corrected incomplete references.
Response to Referee report 2 Comments:
Weaknesses:
Comment (weakness) 1: While the methodology is a step in the right direction, it is far from clear that the proposed way to handle the statistically dependent Zvalues or, equivalently the statistically dependent pvalues, is superior to methods that are already in use for the standard look elsewhere effect.
Response (weakness) 1: The methodology is designed to evaluate the potential lookelsewhere effect arising from semisupervised classifiers, rather than the standard lookelsewhere effect. This method employs the frequentist framework to calculate this effect directly (through statistics of multiple tests), instead of relying on the approximations typically used for calculating the classic lookelsewhere effect.
Comment (weakness) 2: The authors point out correctly that the NN classifier needs to be statistically independent of the leptonleptonphoton mass because the latter is used to define the two samples that are used for training, but they do not provide evidence that this is the case.
Response (weakness) 2: To make clear the dependence of the NN training samples on the leptonleptonphoton mass, we have added the definitions of the side band and mass window categories to Section 2.1. (previously only in Section 3.3).
Comment (weakness) 3: There is some inconsistency in the notation in Eqs. (1), (2) and (4). The symbols for invariant mass and Z seem to change. The invariant mass is denoted by a symbol with lepton and photon subscripts in the text but is denoted by m in Eq. (2). In Eq. (3) the symbol Z is used for signal significance. But in Section 4, which introduces the “probability density functions (PDF) of the significances”, the presumption is that these are the PDFs of the Zvalues. If so, one might have expected to see symbols such as fi (Zi), i = 1,…,6 for each of the six significances Zi, but instead one sees the symbol f (sigma). Since Z (from Eq. (3), which is a wellknown generalization of Ns / sqrt(Nb)) can be interpreted as the “number of standard deviations above the background”, it is unclear what “sigma” is intended to represent in f (sigma). It could be sigma = Z sqrt(Nb)), but that does not seem to be what is intended. This needs to be clarified. Furthermore, from Eq. (3) it is unclear why the domain of f (sigma) should extend to negative values. That also needs clarification.
Response (weakness) 3: With regards to notation please see requested changes 1 and response.
In Equation 3 the domain is extended to negative values to allow a full analysis of the dynamics of each sample. If a negative signal yield is found then the corresponding significance must reflect that. This is used to verify the fitting process and therefore confirm the validity of the positive only values of interest.
Comment (weakness) 4: It is unclear why the fits are not performed so that they give Ns and Nb directly. For example, if the standard particle physics toolkit RooFit were used that would be the case since the probability densities are automatically normalized.
Response (weakness) 4: The fit methodology was initially achieved using the RooFit toolkit as well as asymptotic calculator to calculate the significance as suggested. This was repeated using the more “manual” process described in the paper (after verifying the results reflected those of the internal toolkit methods) to provide a more comprehensive and clearer methodology.
Comment (weakness) 5: In Section 4, it is unclear why the average PDF (Eq. (6)) is the appropriate quantity to use to compute the global pvalue. (See discussion in Report.)
Response (weakness) 5: Please see requested changes 2 and response.
Requested Changes:
Comment (requested change) 1: Remove inconsistencies in the notation
Response (changes) 1:
To improve consistency of notation we have updated equations 2 to use the correct mass notation (mℓℓγ). We have updated Equations 4, 5, 6 and 7 as well as Figure labels and in text references to use “Z” for significance rather than misleading use of σ (e.g. changed f(σ)BR to f(Z)BR).
Comment (requested change) 2: Motivate use of Eq. 6 (... it is unclear why the average PDF, Eq. (6), which is billed as the “global” PDF is the appropriate quantity to use to compute the global pvalue. Is it not necessary to account for the statistical dependencies between the Zvalues if all six are used?)
Response (changes) 2: We acknowledge that Eq. 6 was misleading. We have therefore removed Eq. 6 and replaced it with explanatory text. The global distribution of results is directly calculated as the ensemble or all the outcomes across all background rejections and not the average or weighted average. Although there are different numbers of events entering the fits of each BR, the number of fits/significance values for each BR is equal to the number of pseudoexperiments. Thus the global values include outputs from all BR to understand the dynamics of the semisupervised response across all BRs without focusing on potential bias in individual categories (which is exposed in the comparative plots of local/BR distributions).
Comment (requested change) 3: Given an analysis that yields K statistically dependent pvalues (or Zvalues) explain why methods such as the Bonferroni correction are not sufficient to account for this particular look elsewhere effect or why methods already in use in particle physics cannot be applied to this look elsewhere effect.
Response (changes) 3: Methods such as Bonferroni and Vitells are approximations used to estimate the lookelsewhere effect (LEE). The frequentist framework used is a way of calculating the LEE directly providing a true depiction of the extent of induced error. Additionally, when calculating the look elsewhere effect generated from semisupervised NN, one cannot assume that standard estimations are sufficient and must be calculated directly.
Once again, we extend our gratitude to the referee for their detailed and insightful comments. We trust that the comprehensive responses and revisions provided adequately address all the points raised. We are optimistic that our revised manuscript now meets the criteria for publication in SciPost.
Current status:
Reports on this Submission
Report
This paper is about the trials factor in a variant of weakly supervised anomaly detection at the LHC, which as far as I can tell is just the usual CWoLa method but with multiple thresholds on the anomaly score. The authors are interested in quantifying a trials factor associated with performing multiple simultaneous bump hunts with these different thresholds.
I have one major concern about the methodology of this paper. To the best of my understanding, the trials factor being estimated here is not the usual trials factor considered in HEP. The usual trials factor would involve comparing the nominal value Zmax=max(Z1,...,Z6)  the max local significance across the different thresholds (six in this paper)  and the true significance obtained by constructing the distribution of Zmax using pseudoexperiments.
Instead, the authors are considering a different kind of trials factor. They are comparing
(a) the pvalue p(Z) corresponding to a Z score, calculated from the CDF of the distribution obtained from flattening all 6 Z scores across all pseudoexperiments into a single distribution; to
(b) the p(Z) derived from the CDF of the Zscore distribution for the inclusive bump hunt.
They are referring to (a) as the global significance and (b) as the local significance. I think I'm okay with calling (a) a global significance, but I don't understand why it should be compared only against the inclusive Zscore. Shouldn't it also be compared against the Zscores from all the other CWoLa thresholds as well?
The authors should also acknowledge/clarify that the trials factor they are considering here differs from the other type of trials factor often considered in HEP.
Some more specific points of feedback are:
1) References 1926 are woefully incomplete. The authors should do a more thorough and proper job surveying the literature and add many more references on the applications of weak supervision to anomaly searches at the LHC. A lot of work has been done by many authors on this topic and the authors are not giving the proper credit where credit is due.
2) "therefore the training samples should be
indistinguishable to the NN apart from statistical fluctuations."
What if the SB and SR events have systematic differences (ie the features are not uncorrelated with m)? The authors do not demonstrate that the CWoLa method is even valid here.
3) I don't understand Fig 5b, why is the pvalue more significant than expected from Z_T for lower background rejection fractions (eg for 0% selection, the pvalue is 3 sigma when Z_T is 2 sigma)?
4) If the NN sculpts the m distribution this will also inflate pvalues. How can the authors be sure that it is a LEE and not sculpting?
Recommendation
Ask for major revision
Strengths
1. The authors have addressed my comments and suggestions.
Weaknesses
1. I still have quibbles, but there are no showstoppers.
Report
The paper passes the bar.
Requested changes
None.
Recommendation
Publish (meets expectations and criteria for this Journal)
Report
I would like to thank the authors for following my suggestions and addressing my comments. However, there are still two items (my previous comments #2 and #3) that have not been satisfactorily considered.
1  Former comment #2. I still find it exaggerated that nearly half of the bibliography of this paper (35 references out of 73) is necessary to justify the choice of the $Z\gamma$ example, that is just a showcase for the methodology. Moreover, among these 35 references, there are 15 selfcitations (specifically, the entire set of chosen theory papers discussing the multilepton anomalies).
Additionally, the text is written in a way that might lead readers to believe that a model in which the Standard Model is extended by three scalars with welldefined masses has been confirmed by data. This is not the case, as such a confirmation could only be made by the LHC experiments (and not by any theoretical study). This should be made crystal clear, and the text should be rephrased accordingly.
Finally, the authors mentioned that they chose the multilepton anomalies as a showcase for the proposed method. While this is fine, I do not understand their reluctance to add a second example. As they stated in their reply to my report, the methodology is independent of the chosen physics case. Therefore, there should be no reason to refuse to add a second, different and maybe more traditional, physics example.
2  Former comment #3. The description of the simulation tool chain used in this work is now very clear, and I thank the authors for adding the associated text in section II.1. However, I do not understand why virtualphoton contributions have been ignored in the simulation of the signal. The process considered is $pp \to \ell\ell\gamma$ with the lepton pair invariant mass being off the $Z$ peak. Therefore, there is no reason to only consider $Z$mediated diagrams, as the impact of the virtualphoton component on the signal is not negligible. This should be trivial to fix with MG5aMC.
Furthermore, the authors should explicitly state which PDF set they have used. Is it NNPDF 2.3 as suggested by the reference quoted?
Recommendation
Ask for minor revision