SciPost Submission Page
Exploring unsupervised top tagging using Bayesian inference
by Ezequiel Alvarez, Manuel Szewc, Alejandro Szynkman, Santiago A. Tanco, Tatiana Tarutina
This is not the latest submitted version.
This Submission thread is now published as
Submission summary
Authors (as registered SciPost users):  Ezequiel Alvarez · Santiago Tanco 
Submission information  

Preprint Link:  scipost_202301_00014v1 (pdf) 
Date submitted:  20230110 21:15 
Submitted by:  Tanco, Santiago 
Submitted to:  SciPost Physics Core 
Ontological classification  

Academic field:  Physics 
Specialties: 

Abstract
Recognizing hadronically decaying topquark jets in a sample of jets, or even its total fraction in the sample, is an important step in many LHC searches for Standard Model and Beyond Standard Model physics as well. Although there exists outstanding toptagger algorithms, their construction and their expected performance rely on Montecarlo simulations, which may induce potential biases. For these reasons we develop two simple unsupervised toptagger algorithms based on performing Bayesian inference on a mixture model. In one of them we use as the observed variable a new geometricallybased observable $\tilde{A}_{3}$, and in the other we consider the more traditional $\tau_{3}/\tau_{2}$ $N$subjettiness ratio, which yields a better performance. As expected, we find that the unsupervised tagger performance is below existing supervised taggers, reaching expected Area Under Curve AUC $\sim 0.800.81$ and accuracies of about 69\% $$ 75\% in a full range of sample purity. However, these performances are more robust to possible biases in the Montecarlo that their supervised counterparts. Our findings are a step towards exploring and considering simpler and unbiased taggers.
Current status:
Reports on this Submission
Report 2 by Tilman Plehn on 202326 (Invited Report)
 Cite as: Tilman Plehn, Report on arXiv:scipost_202301_00014v1, delivered 20230206, doi: 10.21468/SciPost.Report.6685
Strengths
 Resilience of subjet taggers is one of the key questions, and independent of the question if this tagger will eventually be used, it is important to test and publish ideas which might eventually help.
Weaknesses
 Concerning the general case: I am not at all convinced that top taggers need to be trained on MC, because there is data to train on, and then MC to check. For toptagging this strategy is selfinflicted damage by the experiments.
 Looking at the method: zeta appears between Eq.(1) and Eq.(3), and I am not sure I get the logic of k vs zeta, and classification vs regression. Please stick to one problem in all formulas.
 Can you put your Bayesian method into MLcontext, like the conditional flow (cINN) or BayesFlow inference. Generative models for Bayesian inference are popping up everywhere these days.
 Related to the ML methods, how about extending this method to more than one dimension?
 Sorry, to understand the Fourier philosophy, what happens beyond angles, for instance the constituent energies?
 I am not sure I understand the QCD toy model, should that not be a peak at small angles? Maybe I am not getting it, it that is the case, I apologize in advance.
 Please introduce the SVI a little more carefully, since it is key to the method.
 Moving to the results: for the Nsubjettiness model, why a Gamma distribution?
 Most importantly, should that method not work for more than one observable? We know that singleobservable taggers are not good. Except for maybe Softdrop, why not use that one? Or jet mass correlated with Nsubjettiness?
 Why to the authors believe their method does not introduce a bias? What about the choice of shape functions? Maybe try two and see what happens for one of their cases?
 A few comments on the references: why cite the Higgs discovery in a top tagging paper?
 I agree that top physics is a great future program, but please give some motivation.
 In the introduction, please give credit to BDRS for traditional subjet tagging, and please cite the original HEPTopTagger paper  replacing the later [14]?
 For the subjet taggers, please mention the original jet images paper.
 Please cite the MLtaggers compared in Ref.[18], to help the original authors of the different ML top taggers. It's not that many...
 Finally, in the conclusions I do not understand the sentence `it could have very challenging and promising approaches' on p.10 means what?
Report
As mentioned above, the paper is not superbrilliant, but it is more than interesting enough to publish in SciPost Core. My comments are mostly concerned with the presentation, but maybe the authors could think about and discuss the physics questions as well.
Requested changes
See the weaknesses, they can call be fixed easily.
Anonymous Report 1 on 202323 (Invited Report)
 Cite as: Anonymous, Report on arXiv:scipost_202301_00014v1, delivered 20230203, doi: 10.21468/SciPost.Report.6668
Report
In this paper, authors have studied how unsupervised learning methods can be used to identify boosted top quarks from QCD jets. They claim that their approach is robust against possible Monte Carlo bias, and for this reason, this method should be helpful in
real data analysis. The concept seems interesting. However, I would like to ask a few questions before the acceptance.
Questions and comments:
1. They have discussed the possible bias in the Monte Carlo generator and cited two papers. However, a detailed discussion should be there in the paper. It would be nice if some quantitative statements could be made.
2. If we apply this method to actual data, quark jets contaminate the dataset. The quark fraction will depend on the p_T range of the jet. How does the inclusion of quark jets in the dataset change the outcome?
3. In principle, this method can also induce bias. Is it possible to estimate the bias and conclude that the bias is negligible compared to the bias of supervised methods?
4. In this paper, the authors used the nsubjettiness ratio variable tau_3/tau_2. Is it possible to
include more variables in this analysis? Supervised toptaggers use multiple
variables, and correlations among variables help to improve the accuracy.
5. For comparison, it would be good to provide the ROC using the same variable (tau3/tau_2)
using labeled dataset information.
6. In this paper, authors consider only one p_T bin. How the results vary as we change the energy bin ( say 1 TeV) is unclear.
Author: Santiago Tanco on 20230314 [id 3479]
(in reply to Report 1 on 20230203)
We thank the reviewer for the questions and comments which we believe have improved the present work. We have made explicit all relevant changes in magenta. To answer the reviewer's comments:
They have discussed the possible bias in the Monte Carlo generator and cited two papers. However, a detailed discussion should be there in the paper. It would be nice if some quantitative statements could be made.
We have expanded the discussion on page 2 on possible biases in the Monte Carlo generators with special emphasis on how it could affect Top quark measurements. We have not provided numbers for these biases as these would require an indetail study and would not be straightforwardly comparable to any metric derived from our method.
If we apply this method to actual data, quark jets contaminate the dataset. The quark fraction will depend on the p_T range of the jet. How does the inclusion of quark jets in the dataset change the outcome?
The reviewer is correct in that the quark composition depends on the $p_T$. In this case, the quark jet contribution is already contained in the QCD sample and thus would not change anything for the narrow $p_T$ bin which was used to generate the dataset. We have modified the text on page 2 to make explicit the fact that QCD contains both gluons and quarks. The $p_{T}$ dependence of the observables can be taken into account by binning in $p_T$ and training different models for each bin or by incorporating $p_T$ as another variable and modifying the probabilistic model accordingly. This is feasible if a good guess for the $p_T$ dependence of the relevant variables is known.
In principle, this method can also induce bias. Is it possible to estimate the bias and conclude that the bias is negligible compared to the bias of supervised methods?
The reviewer is correct in that the method also induces bias. In our case, the bias resides in the choice of the classdependent PDFs. In a semisupervised framework, simulations can be considered to check whether the obtained probability distributions match the desired probability distributions. This is what we did for Nsubjettiness and found that the choice of Gamma induces the wrong shape for the QCD distribution. In a fully unsupervised case, other metrics such as the total KullbackLeibler divergence or more powerful checks such as a posterior predictive check could be implemented. This series of checks are part of bayesian model building and, in opposition to traditional supervised analysis, are easily interpretable. We do not claim that our method as presented here is currently better than a supervised analysis as the proposed functional forms do bias the results, but we claim that it is a way forward for more robust taggers that are simulation independent. We have clarified this in the main text and added posterior predictive checks to emphasize that, although biased, the model is still accomplishing the density estimation task it is designed to do. The motivation and definition of this check can be found on pages 56, while the results are given in pages 10 and 11 for the different mixture models.
In this paper, the authors used the nsubjettiness ratio variable tau_3/tau_2. Is it possible to include more variables in this analysis? Supervised toptaggers use multiple variables, and correlations among variables help to improve the accuracy.
The reviewer is correct in that a multidimensional input is a key feature of the powerful top taggers. In our case, including multiple variables is equivalent to specifying a multidimensional distribution with a particular correlation structure. If this can be done either by choosing independent variables or by an accurate modelling of the multidimensional distribution, multiple variables can be included.
For comparison, it would be good to provide the ROC using the same variable (tau3/tau_2) using labeled dataset information.
We have changed Fig. 4b in order to show the ROC calculated using the labeled dataset, instead of the fitted probability density functions. Nonetheless, these ROC curves are very similar and this change is hardly noticeable in the plot.
In this paper, authors consider only one p_T bin. How the results vary as we change the energy bin ( say 1 TeV) is unclear.
As we mentioned in the previous response, the reviewer is correct in that the results are $p_T$ dependent. Although we have not explored it, at higher $p_T$ the taggers will have a degradation in performance as Top and QCD jets become more alike and the observables themselves lose discrimination power. We have added a footnote clarifying the $p_T$ dependence of the taggers when we introduce the "Top Quark Tagging Reference Dataset" in Section 2.
Author: Santiago Tanco on 20230314 [id 3480]
(in reply to Report 2 by Tilman Plehn on 20230206)We thank Tilman Plehn for the very detailed review with questions and comments which we believe have improved the present work. We have made explicit all relevant changes in magenta. To answer the different comments:
We agree. We consider the Top Quark Tagger dataset only for benchmarking but the aim of this method is to be applied directly on data.
The notation was selected to separate between classmembership and model parameters for $p(xparam)$. Here $k$ is the class index and $\zeta$ the model parameters which are random variables in the Bayesian framework. $\zeta$ includes both the binomial probability parameter $\pi_{k}$ and the classdependent parameters $\zeta_{k}$. We build a classifier out of the probability of a given class assignment $C_k$ given the data and the parameters $p(C_kx,\zeta)$. There is no regression in this setup, only density estimation for classification.
We have added a bit of context on page 3, comparing our method to other generative processes in the literature. Because the literature is so vast, we have mentioned only a subset of the relevant HEP papers to show other approaches and their differences with ours. Mainly, we show how we are aiming to capture the multimodality of the data in a way that can be matched to true physical processes and which relies on simulations as little as possible. We have not included a BayesFlow reference because we believe the other references included also deal with learning surrogate models with Invertible Networks and are directly applied to High Energy Physics.
To extend the method to more dimensions we would need to specify a multidimensional distribution with a particular correlation structure. As we mention in Section 5, we found that the additional bias can offset the benefit of including more variables. However, as with selecting better onedimensional distributions, there are possible ways forward by going to more sophisticated probabilistic models.
The observation is very suitable. In fact, one could consider how much energy is deposited in each {\it slice} and extract the information using these values to compute the Discrete Fourier Transform. We have computed this alternative and we encounter the same problem: the true distributions for signal and background do not separate enough. The overall performance is not better when using energies instead of counting tracks. We have added this information to the manuscript on page 10.
If we understand this question correctly, the Referee observes that in fact in the toy model QCD is modeled as a Normal distribution around the center of the jet and hence it populates the small angles nearby the center. In the Fourier analysis we take slices whose corners are in the center of the jet. Therefore in this case, with all tracks near the center, we expect all of the slices to get fairly equally populated in tracks, since it does not affect whether the tracks are near or far from the center, but rather their distribution in the angle sweeping the (theta, phi) plane. Therefore, this distribution should not provide a special activation for the $\tilde A_3$ component.
We have added a more detailed description of SVI on pages 4 and 5.
Ideally, the shape function choice would be completely justified from first principles much like Softdrop multiplicity for quark/gluon tagging. However, when this is not the case, simplicity is perhaps the guiding principle. In this sense, we first considered a Beta distribution because this ratio of Nsubjettiness observables should be a random number in the [0,1] range. However, because we are dealing with the approximate computation of Nsubjettiness, we selected the Gamma distribution which can accommodate larger than 1 or close to 1 suboptimally computed Nsubjettiness values while still being fairly close to a Beta distribution. In the end, this is an arbitrary choice which is reflected in the resulting bias which could be bettered. We have added a sentence clarifying this methodology on page 11.
We did not consider Softdrop as an initial variable because we found that the Nsubjettiness distribution and its differences between Top and QCD could be chosen less arbitrarily. However, our method could easily accomodate SoftDrop once a shape function choice is made, specially because we found our choice of shape functions for Nsubjettiness to be imperfect. In regards to combining observables, and as discussed in the multidimensionality response, we did not incorporate multidimensional outputs because of the difficulty on modelling multidimensional correlations.
Our method does indeed introduce bias in the choice of the shape functions. This is why we cannot appropriately recover the QCD distribution. However, this bias can be verified explicitly within a Bayesian framework. We have added one such method, the posterior predictive check, in the main text to show how our method may be biased for NSubjettiness but is still able to recover the approximate total density distribution. A general description of the method was added on pages 5 and 6, while the results for the different mixture models are given on pages 10 and 11. As discussed in the multidimensionality response, incorporating more general shape functions with appropriate diagnostics can reduce the bias while still being more robust than MC taggers.
We included these references in the first sentence of the Introduction as a mere example to illustrate LHC success. However, it is true that in the context of the paper they may seem a bit offtopic, so we decided to remove them.
We have added some motivation on top physics in Section 1.
We modified the text to give credit to BDRS and we have cited the original HEPTopTagger papers. We didn't remove [14] as it is still a valuable reference.
We have added a reference to the original jetimages tagger.
We have added references to the original ML top taggers cited in [18].
We agree with the Referee that further discussion is needed after this sentence. We have added a few sentences which help to visualize the appealing and scope of the proposed idea.