SciPost Submission Page
MACK: Mismodeling Addressed with Contrastive Knowledge
by Liam Rankin Sheldon, Dylan Sheldon Rankin, Philip Harris
Submission summary
Authors (as registered SciPost users): | Dylan Rankin |
Submission information | |
---|---|
Preprint Link: | scipost_202503_00008v1 (pdf) |
Date submitted: | 2025-03-05 09:24 |
Submitted by: | Rankin, Dylan |
Submitted to: | SciPost Physics |
Ontological classification | |
---|---|
Academic field: | Physics |
Specialties: |
|
Approaches: | Computational, Phenomenological |
Abstract
The use of machine learning methods in high energy physics typically relies on large volumes of precise simulation for training. As machine learning models become more complex they can become increasingly sensitive to differences between this simulation and the real data collected by experiments. We present a generic methodology based on contrastive learning which is able to greatly mitigate this negative effect. Crucially, the method does not require prior knowledge of the specifics of the mismodeling. While we demonstrate the efficacy of this technique using the task of jet-tagging at the Large Hadron Collider, it is applicable to a wide array of different tasks both in and out of the field of high energy physics.
Author indications on fulfilling journal expectations
- Provide a novel and synergetic link between different research areas.
- Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
- Detail a groundbreaking theoretical/experimental/computational discovery
- Present a breakthrough on a previously-identified and long-standing research stumbling block
Author comments upon resubmission
List of changes
Review 1 comments:
1. As mentioned in the weaknesses, the authors should clarify the motivation for the specific contrastive loss used. Since it's a major portion of thee work, the authors should at least write down the loss in the text and explain each term instead of simply giving a citation.
- Done.
2. How is the alternate samples constructed? Is it only background as Fig. 1 suggests or is it a mix of both? I would expect that you want the alternate sample to follow as close as possible the training samples, i.e same expected composition and fraction of signal/background events to have a sensible EMD pairing, is this correct? If so, is this an issue with the current implementation?
- The alternate samples are constructed only with background, as this is how it would need to be done in the case where real data (without signal) is used. In some applications it might be possible to include an impure signal region from data (eg. W-tagging) and we expect as you suggest that this would improve the results, but we do not consider it in our results and leave the exploration of this to future work. We have added a comment on this to Section 3.1.
3. Fig.2 The ROC curves are very hard to read and impossible if you are colorblind. The figure label helps but the legend needs improvement. Have 2 boxes, one that just says that testing on nominal is a full line and testing on alternative samples is a dashed line. In the other box, make it clear and possibly with other colors what do they mean: one line for MACK without augmentations, one for MACK with augmentations, one for trained on nominal aug., one for trained on alternative aug..
- Done.
4. In the same figure, I don't understand why training on nominal and evaluating on nominal has such a low performance compared to MACK. Worse, the performance on both nominal and alternative samples seem to be equally bad. Where does the additional performance comes from? Maybe I'm reading the plot wrong, which reinforces my point that the labels aren't clear.
- This poor performance comes from the fact that the training procedure still requires the MACK method in all cases. Thus, if the contrastive training only brings nominal and alternative representations together but does not result in features that are effective at separating signal and background, then the resulting classifier will be quite poor. We have added a comment on this to the text to address confusion.
5. Table 1: would be great to add errors to that table to see the stability of MACK across different runs.
- Done.
6. The extreme example is interesting, but also highly unrealistic. However, I would be curious to see what happens if you keep the training strategy on JetNet as you have, with the alternate samples coming from a mixture of other physics processes, but evaluated the classifier on the same physics processes used for training, but with some modification, either a different tune as you already did in the previous exercise, or different generator. The main reason I ask is because I would like to know what is the impact of the choice of "data" used when calculating the pairs and aligning the embedding. If my data have other processes, or if the fraction of signal and background is different, does MACK hurt or still helps?
- This is a great question and one which we hope to address in future work, but given that it requires a broader set of samples than we have available we elect to leave it to future work.
Review 2 comments:
1. The method description needs more details about the training procedure. It would greatly benefit from introducing all used loss terms instead of only citing other papers.
- Done.
2. On page 3, in the last paragraph:
"...a featurizer and a classifier network. The featurizer and projector network each separate share weights between the legs of the siamese network."
In the first sentence you call it a classifier, and then a projector? Is it the same? It needs to be clarified.
- Thank you for catching this, they are indeed the same. We have made the language consistent throughout.
3. While you cite a paper in Section 4, that the usage of an additional projector after the featurizer is well-motivated, I would suggest adding this line of argumentation to Section 3, where these networks are introduced, and maybe add 2-3 lines of explanation why this is a good idea. Before reading this sentence, I asked myself in Section 3, "OK, but why do you need this additional projector exactly?" So far, it is not clear.
- We have added an explanation of this behavior to Sec. 3.1
4. Also, given that you have multiple networks and subnetworks with different training phases, it is not immediately apparent which (sub)network is trained or kept fixed. This could be summarized with a table or plot.
- We have added additional description of the networks we train to Sec 3.1 and Sec 5.1 to help clarify the fine-tuning procedure.
5. If I understand correctly, your supervised network trained on the feature space L differs from the "Supervised Model Design" you mention in Section 4.2, doesn't it? Right now, it is not clear. Parts of the models are introduced in Section 3, and other parts only appear in Section 4. What about streamlining this into a single section, where all networks and classes are properly introduced, and then the introduction of the dataset becomes part of Section 5, which is about the actual application? This would make the storyline much more straightforward. Moreover, this structure would emphasize again that the MACK model itself is independent of the actual downstream task.
- We have adjusted the structure of these two sections significantly (although the content remains largely the same). Section 3 now introduces the method and the network architectures, while Section 4 presents the datasets used.
6. The label and legend sizes of your Figures 2, 3, and 4 are currently tough to read, and the plot makes it very difficult to understand what is happening. This could be illustrated more clearly. Right now, it's very difficult to understand how the performance is changing and what to look at.
- We have updated the figures to make them easier to understand.
7. On the same line, where do the error bars in the Figures come from? Moreover, if you have them there, can you also introduce error bars in Table 1?
- The errors are statistical-only from the testing dataset. We have added this statement to the paper and table values.
Current status:
Reports on this Submission
Report
Thanks for the update. Overall I am happy with the modifications. There are, however, still two aspects which I would ask to fix in another minore revision before publication. See below
Requested changes
1. Thanks for adding the loss function! I think we are getting closer, but the current indexing could still be clarified a bit. For example, you have an object Pi where you sum over in the definition of c(P), and then you continue summing over indices j,k — but it is not clear where those come from. Similarly, I think the Pi in c(P) and the Pj in v(P) do not refer to the same kind of object, which might cause confusion.
I suspect the root of the issue is that P actually has two types of indices: one for dimensions (running from 1,…,d, as in v(P)) and one for the vector count (from 1,…,n). It would be great to introduce these indices explicitly—maybe using i,j for dimensions and a,b for indexing the different vectors—and then structure the summation accordingly.
Interestingly, the original paper seems to use similarly loose notation, so there is a real opportunity here to clarify things once and for all with a more precise formulation. That would be a valuable contribution!
2. One more thing: I still find Figures 2–4 quite hard to read. Would it be possible to increase the label and legend font sizes a bit? It would really help to be able to read the plots without needing to zoom in too much.
Recommendation
Ask for minor revision