Symmetries, Safety, and Self-Supervision

Barry M. Dillon; Gregor Kasieczka; Hans Olischlager; Tilman Plehn; Peter Sorrenson; Lorenz Vogel

SciPost Submission Page

Symmetries, Safety, and Self-Supervision

by Barry M. Dillon, Gregor Kasieczka, Hans Olischlager, Tilman Plehn, Peter Sorrenson, and Lorenz Vogel

This is not the latest submitted version.

This Submission thread is now published as

SciPost Phys. 12, 188 (2022)

Submission summary

Authors (as registered SciPost users):

Barry Dillon · Tilman Plehn · Lorenz Vogel

Submission information
Preprint Link:	scipost_202108_00046v1 (pdf)
Code repository:	https://github.com/bmdillon/JetCLR
Date submitted:	2021-08-18 11:17
Submitted by:	Dillon, Barry
Submitted to:	SciPost Physics

Ontological classification
Academic field:	Physics
Specialties:	High-Energy Physics - Phenomenology
Approaches:	Theoretical, Phenomenological

Abstract

Collider searches face the challenge of defining a representation of high-dimensional data such that physical symmetries are manifest, the discriminating features are retained, and the choice of representation is new-physics agnostic. We introduce JetCLR to solve the mapping from low-level data to optimized observables though self-supervised contrastive learning. As an example, we construct a data representation for top and QCD jets using a permutation-invariant transformer-encoder network and visualize its symmetry properties. We compare the JetCLR representation with alternative representations using linear classifier tests and find it to work quite well.

Current status:

Has been resubmitted

Reports on this Submission

Anonymous Report 4 on 2021-12-5 (Invited Report)

Cite as: Anonymous, Report on arXiv:scipost_202108_00046v1, delivered 2021-12-05, doi: 10.21468/SciPost.Report.3999

Report

Very interesting study on the implementation of the already known physical symmetries on the network using contrastive learning. The study has been well documented with exceptions that are already pointed out by other referees. Beyond those comments, I would like to raise two issues.

1) Authors are pointing to a similarity observable to be used in the processing of the network in equation 6. Can they elaborate on this further in terms of, can there be any other approaches such as the usage of different similarity constructions. If so what kind of effect would they create on the outcome of the network.
2) In figure 4 authors are showing the evolution of the test data with respect to each epoch. However, this hasn't been compared to the training data so it's not possible to tell if this model is well trained or not.

validity: -
significance: -
originality: -
clarity: -
formatting: -
grammar: -

Author: Barry Dillon on 2022-01-18 [id 2105]

(in reply to Report 4 on 2021-12-05)

We'd like to thank the referee for the comments and suggestions for the manuscript. We have implemented these in the resubmitted draft. The details of the changes are in the author comments in the resubmission page.

Anonymous Report 3 on 2021-11-16 (Invited Report)

Cite as: Anonymous, Report on arXiv:scipost_202108_00046v1, delivered 2021-11-16, doi: 10.21468/SciPost.Report.3851

Report

This is interesting work where the authors aim to exploit intrinsic symmetries of high-dimensional data to discriminate between top quark and QCD jets. They introduce a method called jetCLR (jet observables for Contrastive Learning of Representations) and benchmark it using the linear classifier test. The paper is well written in a clear and didactical style. However, some valid suggestions to improve further the presentation have been made by the other referees. Therefore, I suggest publishing this paper after addressing the points raised in previous referee reports.

validity: -
significance: -
originality: -
clarity: -
formatting: -
grammar: -

Author: Barry Dillon on 2022-01-18 [id 2104]

(in reply to Report 3 on 2021-11-16)

Thanks for the comments, we believe we have implemented the suggestions made by the other referees. The details are in the author comments in the resubmission page.

Report 2 by Andy Buckley on 2021-11-16 (Invited Report)

Cite as: Andy Buckley, Report on arXiv:scipost_202108_00046v1, delivered 2021-11-16, doi: 10.21468/SciPost.Report.3850

Strengths

1. A useful, and helpfully practical guide to an approach for forcing approximate physical symmetries into neural-net classifiers

2. Convincing demonstration that this method produces a performant top/QCD jet classifier, comparable to or exceeding the discriminating power of other methods (either fully non-parametric, or using an explicitly physical basis)

Weaknesses

1. Unclear presentation or assumed knowledge of various ML concepts, particularly performance metrics. Some assumption is reasonable these days, but this paper takes it a little too far, and erects unnecessary barriers to understanding by non-experts. A little text reworking would improve this greatly.

Report

A nice and compact paper, exploring constructively the important business of encoding a priori physical symmetries into neural-network applications in physics. The practical demonstration of methods to force such encoding rather than simply hope for the network to learn known features is as important an output as the JetCLR code itself. I have a few comments primarily on the presention, which is sometimes disjointed or opaque -- particularly for newcomers to the area -- and being more approachable would take little work and be only a good thing.

• Sec 2, p3: Not key to the thrust of the paper, but I assume that (in addition to generation simplicity) the reason to disable MPI as well as ignoring pile-up is that in practice jets would be groomed in some way that effectively removes both contributions? No grooming step is mentioned, but maybe the Delphes particle-flow algorithm is meant to approximate pile-up/UE suppression. Probably the end result is not strongly affected, particularly as only the leading constituents are passed to the network (perhaps better would have been to pre-cluster to a fixed max number of constituents?), but it'd be good to justify the input choices as being reasonably close to the realistic application.

• p3-4: in this presentation it's hard to know what is signposting and what is the actual full presentation of the idea. I think some more advance notice is needed, to paint the picture that what is coming is use of known pairings in the training set, and a matching loss function, to empirically nudge rotational, translational, IRC, and permutation symmetries into the resulting network weights and architecture. It'll then be more obvious what is going on from the start of each subsection, which is currently opaque until a second reading in several cases.

• Sec 3 intro: Maybe note prior art on permutation invariance in the work of Thaler et al on deep sets and energy-flow networks (and EFPs), e.g. https://arxiv.org/abs/1810.05165. This would also clarify what exactly is meant by "permutation symmetry" at the top of p6

• The section on attention is quite opaque to anyone not already familiar with the idea in some depth. The naming of the described operator as "attention" isn't motivated, the distinctions between queries/keys and how their originating W matrices are learned isn't given, and the need for any of this isn't clear other than that the sum over elements makes the network permutation-independent. I feel like this section either needs to be larger and to explicitly motivate this mechanism for learning jet structure beyond being one of several permutation-invariant architectures, or to make it much shorter and simply declare that a self-attention based transformer network provides the desired permutation-independence (intrinsically rather than as a strategy in the training).

• It's nice to have some technical detail on dealing with issues like the variable-length inputs and tweaks to ensure ignoring of the zero-pad elements. Very useful and just the right level, thanks.

• Sec 4: "LCT" is used here for the first time since the single occurence in the introduction: I think it would be better to re-introduce it, and to only define the acronym, at this point where it's used. I'm not entirely unaware of ML terminology, but had to re-search the document to find what was being referred to. Having located it and re-read, I still do not know how the linear classifier cut is related to the trained network -- this basic understanding shouldn't require reading the appendix. The relevance of an LCT, i.e. how the performance is being assessed, should also be made clear here: I don't think enough context was given in the intro, and that was a long time ago, intellectually. The first paragraph of Sec 4 is a very natural place to explain how the testing is to be done. (Minor note, it'd be helpful to give a hint of what the "very different applications" of Refs [37,55] actually are, rather than forcing the reader to jump to the bibliography and cross-check!)

• Tab 2: I know it's standard in ML literature, but AUC (and the epsilons) should be defined, and some motivation given for why it's a good metric (it is not clear to me why the integral performance of the model, rather than the performance of the best working point within it, is the best metric).

• Nice result on the reducing S/B ratio! And also the overall ROC curve look nice -- it would be good, if possible, to comment on why JetCLR appears to be significantly outperforming the EFPs of same latent dimension. Implicit linearity in the discriminator built on EFPs? (I note that little detail is given on the EFP classifier, and it would be good to say more. Also, are the EFPs calculated using the code associated with the original paper?)

Requested changes

1. Better clarity in the relevance of the detailed description of the self-attention structure: either more or significantly less text, with clear connection to the application.

2. A more complete description of the role of the LCT in evaluating network performance, without requiring reference to the appendix.

3. Explicitly define/explain classifier performance metrics and terminology.

validity: top
significance: high
originality: good
clarity: good
formatting: perfect
grammar: excellent

Author: Barry Dillon on 2022-01-18 [id 2103]

(in reply to Report 2 by Andy Buckley on 2021-11-16)

We’d like to thank the referee for the constructive and detailed suggestions. We believe we have implemented all of the requested changes/additions in the resubmitted version of the manuscript. Answers to the questions were given in the author comments in the resubmission. If anything is unclear, please let us know.

Anonymous Report 1 on 2021-10-28 (Invited Report)

Cite as: Anonymous, Report on arXiv:scipost_202108_00046v1, delivered 2021-10-28, doi: 10.21468/SciPost.Report.3753

Strengths

1- novel representation for QCD jets based on recently presented Contrastive Learning

2- technology transfer from self-supervised learning of image representations to HEP

3- potential for future application in data analyses and interpretation

Weaknesses

1- hard to grasp more detailed insights for non expert in ML, similarly for non QCD expert I assume the specifics of jet physics are rather opaque

2- rather cluttered nomenclature in Sec. 3

Report

The paper presents the adaptation and application of the recently presented Contrastive Learning technique to derive suitable representations for QCD jets.
This method has originally been developed for visual representation of actual images. The authors adapt this method to the situation of QCD jets with the particular application example of discriminating plain QCD and top-quark initiated jets in simulated LHC data. However, the scope of the method is wider, including unsupervised anomaly detection in jet data from the LHC experiments.

The authors put particular emphasis on incorporating symmetries relevant for the representation of QCD jets by constructing appropriate augmentations of the jet data used to train the model.

The idea is quite innovative and offers potential for interesting future applications. This certainly qualifies the paper for publication in SciPost. However, prior to publication I would request the authors to address the following list of changes/clarifications.

Requested changes

1- At least a few sentence introducing QCD jets, the central object of the paper, should be added, either in the general intro or in Sec. 2, including what the considered jet constituents represent, i.e. individual particles, detector cells ...

2- Similarly, a brief comment on top-quark vs. light-quark/gluon initiated jets separation seems appropriate.

3- In Sec. 2, when discussing imposed/assumed physical symmetries the made statements should be underpinned by suitable references, in particular about IRC safety.

4- Among the proposed augmentations are translations of the jets in $\eta-\phi$ space, what is the reasoning behind? We know that in particular the $\eta$ distribution of jets is not flat and in fact linked to the jet initiator.

Further, I don't actually understand the relevance of the translations, just before the authors describe that they align each jet with the origin of the $\eta-\phi$ plane, i.e. translating it from its actual position in $\eta-\phi$ anyhow. I am just missing the point?

5- In the intro to Sec. 3 the authors highlight the permutation symmetry of QCD jets, i.e. the invariance under reordering the constituent labels, at least to me the meaning was however not clear in the first go. Maybe a comment would help other readers.

6- The nomenclature used in the paragraph on Attention is quite cluttered and complicated to graps, i.e. when are we considering a single jet, all or just a single jet constituents. This could certainly be improved, e.g. by simply using a second index on the $x_i$. Further, what is $q_i$ standing for beyond $q_1$?

7- On page 9 the authors make a comment about convergence issues in anomaly searches. For me as a non-expert a reference seems to be appropriate.

8- Lastly a more general question, in the way set up the jet representation seems agnostic about the actual production process of jets, e.g. through explicitly imposing rotation symmetry. For certain production modes, however, there might be preferred directions within jets, consider for example jet pull. Would the authors assume that this requires adjustments to the augmentations and correspondingly trainings for specific analyses or to what extend is the proposed setup considered universal?

validity: high
significance: high
originality: high
clarity: high
formatting: excellent
grammar: excellent

Author: Barry Dillon on 2022-01-18 [id 2102]

(in reply to Report 1 on 2021-10-28)

We’d like to thank the referee for the constructive and interesting suggestions, I believe we have implemented everything requested by the referee in the resubmitted version.

As for the question in the 8th point in the report, this is very interesting. So far we have only looked in detail at individual jets, so the augmentations are driven towards enforcing symmetries that are completely agnostic to the event-level dynamics. If we were to look at whole events, we would certainly need to reconsider the types of augmentations and symmetries we include. The most obvious of these are the translations in eta. So, the proposed set-up is only universal in the context of studying individual jets. More work needs to be done when applying the technique to multi-jet or whole events.

Login to report

Comments

Anonymous on 2021-11-16 [id 1948]

This is interesting work where the authors aim to exploit intrinsic symmetries of high-dimensional data to discriminate between top quark and QCD jets. They introduce a method called jetCLR (jet observables for Contrastive Learning of Representations) and benchmark it using the linear classifier test. The paper is well written in a clear and didactical style. However, some valid suggestions to improve further the presentation have been made by the other referees. Therefore, I suggest publishing this paper after addressing the points raised in previous referee reports.