SciPost Submission Page
Consistent, multidimensional differential histogramming and summary statistics with YODA 2
by Andy Buckley, Matthew Filipovich, Christian Gutschow, Nick Rozinsky, Simon Thor, Yoran Yeh, Jamie Yellen
This is not the latest submitted version.
Submission summary
Authors (as registered SciPost users): | Andy Buckley · Christian Gutschow |
Submission information | |
---|---|
Preprint Link: | https://arxiv.org/abs/2312.15070v1 (pdf) |
Code repository: | https://gitlab.com/hepcedar/yoda |
Date submitted: | 2024-01-11 16:23 |
Submitted by: | Gutschow, Christian |
Submitted to: | SciPost Physics Codebases |
Ontological classification | |
---|---|
Academic field: | Physics |
Specialties: |
|
Approaches: | Computational, Phenomenological |
Abstract
Histogramming is often taken for granted, but the power and compactness of partially aggregated, multidimensional summary statistics, and their fundamental connection to differential and integral calculus make them formidable statistical objects, especially when very large data volumes are involved. But expressing these concepts robustly and efficiently in high-dimensional parameter spaces and for large data samples is a highly non-trivial challenge -- doubly so if the resulting library is to remain usable by scientists as opposed to software engineers. In this paper we summarise the core principles required for consistent generalised histogramming, and use them to motivate the design principles and implementation mechanics of the re-engineered YODA histogramming library, a key component of physics data-model comparison and statistical interpretation in collider physics.
Current status:
Reports on this Submission
Report #1 by Giordon Stark (Referee 1) on 2024-4-2 (Invited Report)
Strengths
1. This is a well-written paper describing the evolution of a new version of the YODA package which is directly depended on by Rivet.
2. There is a strong description of the statistical machinery in the beginning which helps ground the work presented by the authors to the mathematics underlying histograms.
3. It's really nice to have a section highlighting "experiences learned" from previous mistakes. It's very important that this portion makes it through the editorial phase and stays.
Weaknesses
1. 5.6 is rather short and bereft. Part of the authors claims here are that the usability is crucial, and a python interface is going to be what makes a software package usable.
2. On pages 7/8, the design principles are nice to have, but I feel like the names assigned to each block is a bit too vague and abstract. "Continuous aggregation" for example does not immediately remind me of what that means. The authors would probably find it best to provide counterexamples or situations in which these design principles do not hold in other existing libraries. Matplotlib is a prominent example of not allowing the so-called "continuous integration". Meanwhile, ROOT allows this. So perhaps trying to highlight that you want the usability of matplotlib but the functionality of ROOT, or similar. I think just more care needs to be taken here.
3. The last two paragraphs in section 3 are unneeded, and quite strongly opinionated. In fact, I think it's unrealistic and unfair to say that "boost.histogram" is less used (without giving context, or numbers or statistics). One could claim that YODA is not used outside of particle physics.
Report
I recommend that the authors submit this instead to SciPost CodeBases (rather than Phys) given the paper is somewhat more focused on the technicalities of the code base, rather than demonstrating a novel physics result.
Requested changes
1. The authors should discuss the overlap with the existing boost-histogram python bindings, as well as the user-friendly "hist" package: https://hist.readthedocs.io/
2. I would appreciate a discussion or acknowledge of the Array API (https://data-apis.org/array-api/latest/) and more specifically, the Uniform Histogramming Interface (https://uhi.readthedocs.io/en/latest/). Section 6 provides a different kind of indexing which is a bit different from standard. Incorporating these, I think, will allow YODA to coexist with other similar histogramming libraries and allow for translation between different toolings. It should also allow for improved usage of YODA outside of HEP.
3. In Section 6, I was expecting some discussion of the underlying storage of the objects (contiguous in memory or not?) and how the authors ensure the histogram lookups are O(1).
4. Sections 6 and 7 should be merged, and reorganized. I expect discussions of the plaintext YAML format to be in the "serialization" rather than in "data exchange" which isn't clear to me what that means. I also expect the CLI interface to have it's own subsection to highlight some of the nice functionalities that come out of the box, but without making it a technical documentation.
5. Bottom of page 9 has a code block that should be a listings block with a reference number. This code block is also not really valid as "ao" and "script_generator" are not defined in this scope.