SciPost logo

SciPost Submission Page

Consistent, multidimensional differential histogramming and summary statistics with YODA 2

by Andy Buckley, Louie Corpe, Matthew Filipovich, Christian Gutschow, Nick Rozinsky, Simon Thor, Yoran Yeh, Jamie Yellen

Submission summary

Authors (as registered SciPost users): Andy Buckley · Christian Gutschow
Submission information
Preprint Link: https://arxiv.org/abs/2312.15070v2  (pdf)
Code repository: https://gitlab.com/hepcedar/yoda/
Date submitted: 2024-04-26 17:37
Submitted by: Buckley, Andy
Submitted to: SciPost Physics Codebases
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Experiment
  • High-Energy Physics - Phenomenology
Approaches: Computational, Phenomenological

Abstract

Histogramming is often taken for granted, but the power and compactness of partially aggregated, multidimensional summary statistics, and their fundamental connection to differential and integral calculus make them formidable statistical objects, especially when very large data volumes are involved. But expressing these concepts robustly and efficiently in high-dimensional parameter spaces and for large data samples is a highly non-trivial challenge -- doubly so if the resulting library is to remain usable by scientists as opposed to software engineers. In this paper we summarise the core principles required for consistent generalised histogramming, and use them to motivate the design principles and implementation mechanics of the re-engineered YODA histogramming library, a key component of physics data-model comparison and statistical interpretation in collider physics.

Author comments upon resubmission

Thanks for the close reading, and we're glad that the explanations of the maths foundations and the design lessons learned were appreciated. We have updated the paper in response to the comments, and thanks for helping us make it clearer.

However, we disagree on some comments that particularly view the Python interface and the Python sci-comp ecosystem of packages and standards as the relevant context, and explain our justifications for not implementing some requests below.

WEAKNESSES

  1. 5.6 is rather short and bereft. Part of the authors claims here are that the usability is crucial, and a python interface is going to be what makes a software package usable.

The final statement seems a personal opinion. Much of HEP still needs histogramming to happen in compiled code, and YODA is firstly a C++ library for use in applications where C++ is user-facing, and usability is a goal. As the C++ interface is fairly clean, a fairly direct mapping of that to Python is applied and there is relatively little to be said about it.

Our previous experience was that it was best for user comprehension to make the Python wrapper as thin as possible, rather than implement lots of Pythonisations that effectively define a new API in Python and require users to maintain two mental models of how YODA objects are to be used: a comment to this effect has been added.

  1. On pages 7/8, the design principles are nice to have, but I feel like the names assigned to each block is a bit too vague and abstract. "Continuous aggregation" for example does not immediately remind me of what that means. The authors would probably find it best to provide counterexamples or situations in which these design principles do not hold in other existing libraries. Matplotlib is a prominent example of not allowing the so-called "continuous integration". Meanwhile, ROOT allows this. So perhaps trying to highlight that you want the usability of matplotlib but the functionality of ROOT, or similar. I think just more care needs to be taken here.

Matplotlib is not a histogramming package, but a plotting one: the exception is its thin wrapping of numpy.histogram to mimic Matlab's historic interface. This is the reason for Matlab and numpy already being the cited counterexamples.

ROOT as the most prominent (though not the original) exemplar of the continuous-aggregation mode is already discussed after the bullet list, and we've tweaked the wording in the definition-list a little. Note that continuous aggregation is unrelated to the continuous integration mentioned -- we assume this was just a typo.

  1. The last two paragraphs in section 3 are unneeded, and quite strongly opinionated. In fact, I think it's unrealistic and unfair to say that "boost.histogram" is less used (without giving context, or numbers or statistics). One could claim that YODA is not used outside of particle physics.

We have removed reference to Boost Histogram usage, as it's not particularly relevant and it's hard to evidence use as literature searches for utility packages just find the authors' writeups.

But we disagree that the remainder is opinionated: as the footnote highlights, the intention is to spell out the ecosystem and the reason for yet another histogramming package (although most of these decisions were made ~15 years ago with YODA v1!) rather than to criticise. We are not aware of any unfactual statements made here about the comparable packages, and consider this section highly relevant to potential users trying to weigh the pros and cons of different solutions.

REQUESTS

  1. The authors should discuss the overlap with the existing boost-histogram python bindings, as well as the user-friendly "hist" package: https://hist.readthedocs.io/

Again, the focus on Python does not seem appropriate: YODA is a user-friendly C++ interface, which is then semi-trivially mapped into Python. We spend almost the entirety of the paper talking about the C++ design and implementation, and so the relevant primary comparators are to other C++ interfaces.

The boost-histogram C++ and Python bindings expose the same API which implements general histogramming but without the user-facing syntactic sugar, and Hist implements some UI refinements but a) only in Python, and b) still rather verbose wrt the HEP baseline of ROOT's API, and requiring a fairly high level of Python stdlib fluency for some operations (e.g. functools.reduce(operator.mul, ...) used in its quickstart example). We think it is not a useful thing to spend time critiquing other packages' APIs in a different language, though.

  1. I would appreciate a discussion or acknowledge of the Array API (https://data-apis.org/array-api/latest/) and more specifically, the Uniform Histogramming Interface (https://uhi.readthedocs.io/en/latest/). Section 6 provides a different kind of indexing which is a bit different from standard. Incorporating these, I think, will allow YODA to coexist with other similar histogramming libraries and allow for translation between different toolings. It should also allow for improved usage of YODA outside of HEP.

Again these are Python-specific: technical capabilities as well as ecosystem differences mean that such standards can't just be mapped across languages, though aspects of our API design were influenced by Python native capabilities.

Interface definitions like UHI are even more localised: it seems to have been developed and deployed entirely within the Python SciKit-HEP community, and is only "universal" within the scopes envisaged for those tools. From our reading, its plotting interface does not envisage plotting of asymmetric or systematic uncertainties, nor representing overflows or irregular tilings, and mixes up statistical moments with display choices (e.g. in prescribing what error bars in profile-type data objects are to mean).

So UHI looks a potentially interesting idea, though limited to Python rendering backends and a subset of functionality, and requiring a separation of value and axis objects which is at odds with YODA's evolved design. It's not clear at present how much value would be added by retro-fitting the "missing" UHI API elements, as the support does seem to be within closely related SciKit-HEP packages, but we will discuss with the UHI author(s).

In total, as there is currently such a mismatch, we feel it would not be appropriate to reference these, only to say that they are not obviously relevant.

  1. In Section 6, I was expecting some discussion of the underlying storage of the objects (contiguous in memory or not?) and how the authors ensure the histogram lookups are O(1).

The text on local/global indices has been improved, and a sentence added to confirm that the actual bin contents are stored contiguously in memory. Lookup of the content from either local or global indices is trivially O(1) given the storage, but this is now explicitly stated. Lookup from coordinates is not, and the strategies used to optimise that index computation are already discussed.

  1. Sections 6 and 7 should be merged, and reorganized. I expect discussions of the plaintext YAML format to be in the "serialization" rather than in "data exchange" which isn't clear to me what that means. I also expect the CLI interface to have it's own subsection to highlight some of the nice functionalities that come out of the box, but without making it a technical documentation.

We strongly disagree that these sections should be merged: data exchange and visualisation are completely different facets, both in principle and in implementation. Also, there seems to be a confusion between in-memory serialisation and file formats here: the two are distinct and hence discussed in distinct subsections, but within the common (and hence generically titled) section on data exchange.

We felt that it was more useful to note cognate CLI abilities where relevant to features, rather than to isolate them into a distinct section: we have highlighted all the CLI features of most interest for this level of document via these contextual notes, and a dedicated subsection would be both disjoint and very short.

  1. Bottom of page 9 has a code block that should be a listings block with a reference number. This code block is also not really valid as "ao" and "script_generator" are not defined in this scope.

We disagree that this should be a listings block: it's an inline example snippet, not something to refer back to. We have now indicated the meanings of the variables in the text, thanks for the note!

REPORT

I recommend that the authors submit this instead to SciPost CodeBases (rather than Phys) given the paper is somewhat more focused on the technicalities of the code base, rather than demonstrating a novel physics result.

The paper is submitted to SciPost Codebases.

List of changes

1. Added explanations of contiguous storage and global-index role
2. Added motivation of thin Python wrapper
3. Removed mention of Boost Histogram usage level, better explained relation to other packages

Current status:
In refereeing

Login to report or comment