SciPost Submission Page
Agents of Discovery
by Sascha Diefenbacher, Anna Hallin, Gregor Kasieczka, Michael Krämer, Anne Lauscher, Tim Lukas
Submission summary
| Authors (as registered SciPost users): | Anna Hallin · Michael Krämer |
| Submission information | |
|---|---|
| Preprint Link: | scipost_202510_00054v1 (pdf) |
| Code repository: | https://github.com/uhh-pd-ml/AgentsOfDiscovery |
| Date submitted: | Oct. 29, 2025, 11:44 a.m. |
| Submitted by: | Anna Hallin |
| Submitted to: | SciPost Physics |
| Ontological classification | |
|---|---|
| Academic field: | Physics |
| Specialties: |
|
| Approaches: | Experimental, Computational, Phenomenological |
Abstract
The substantial data volumes encountered in modern particle physics and other domains of fundamental physics research allow (and require) the use of increasingly complex data analysis tools and workflows. While the use of machine learning (ML) tools for data analysis has recently proliferated, these tools are typically special-purpose algorithms that rely, for example, on encoded physics knowledge to reach optimal performance. In this work, we investigate a new and orthogonal direction: Using recent progress in large language models (LLMs) to create a team of agents -- instances of LLMs with specific subtasks -- that jointly solve data analysis-based research problems in a way similar to how a human researcher might: by creating code to operate standard tools and libraries (including ML systems) and by building on results of previous iterations. If successful, such agent-based systems could be deployed to automate routine analysis components to counteract the increasing complexity of modern tool chains. To investigate the capabilities of current-generation commercial LLMs, we consider the task of anomaly detection via the publicly available and highly-studied LHC Olympics dataset. Several current models by OpenAI (GPT-4o, o4-mini, GPT-4.1, and GPT-5) are investigated and their stability tested. Overall, we observe the capacity of the agent-based system to solve this data analysis problem. The best agent-created solutions mirror the performance of human state-of-the-art results.
Author indications on fulfilling journal expectations
- Provide a novel and synergetic link between different research areas.
- Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
- Detail a groundbreaking theoretical/experimental/computational discovery
- Present a breakthrough on a previously-identified and long-standing research stumbling block
Current status:
Reports on this Submission
Report #2 by Jonathon M. Langford (Referee 2) on 2026-1-2 (Invited Report)
Strengths
1) In-depth investigation into use of agentic AI for particle physics analyses. Covers four different models from OpenAI, and compares stability, reliability and performance for different many configurations. 2) Good use of LHC Olympics test problem to allow comparisons with teams of humans. 3) Clear explanation of setup and authors provide code for reproducibility.
Weaknesses
1) Simplistic analysis with little discussion on how this could be expanded to cover more complex analysis techniques (calibration, statistical modeling etc). 2) Limited by resources as used GPT-4.1 instead of best-performing GPT-5 when comparing performance of different prompts etc.
Report
The paper is of high interest as it is the first thorough investigation into the use of agentic AI for LHC analyses. The methodology seems technically sound in all aspects and is presented in a clear way throughout. The use of language is excellent, and I found little-to-no spelling/grammar mistakes. Therefore, I highly recommend this paper for publication in SciPost, subject to the minor revisions/clarifications listed in the document. Thank you for the interesting work.
Requested changes
See attached document.
Recommendation
Ask for minor revision
Report #1 by Gordon Watts (Referee 1) on 2026-1-1 (Invited Report)
Strengths
- A multi-agent system that shows how well it can work together to solve a problem directly applicable to particle physics. I'm not aware of this having been in a paper previously.
- Comprehensive description (with source code) of a system that starts with a fairly general problem and prompt. Well-written and with source code (that is also well-written).
- Opens the door to a future where we really change how we do these complex analyses - there is a lot of work to do - but this is a proof of principle that I suspect we will be building on for a while.
Weaknesses
- Perhaps the biggest problem is that we are not yet used to writing papers about AI used like this, at least not in this field. Setting up a standard for what is rigor, etc., is yet to come. The very minimum is what you do - running repeatedly to understand if it can complete the task (even if you could have totally controlled the seed, I would have requested this - or slight variations in the question wording, etc.). This isn't really a weakness of the paper, but of the field. And it is nice to see this getting started (this needs no response from the authors).
- I would have liked to see some performance on open source models (deepseek, qwen). The OpenRouter company provides OpenAI end-points that one can use. This is a scoping choice, and the author's choice here is up to them, of course.
- In Figure 3 you have a number of outliers when it comes to significance. I'm sure it already took a great deal of prmopt and infrastructure tuning to get models to do this work - it would be interesting to know what further work needs to be done to make the fraction of high SIC or resolution more common.
Report
Requested changes
The following are suggested changes:
- Page 1, first paragraph, there is nothing that gives a sense of scale other than "it is big" and "complex". There must be a reference somewhere that lays out what an analysis at the LHC for modern-day precision requires - perhaps a CHEP talk in the last 5-10 years?
- Page 3, "A p-value associated..." - do you need a reference for p-value?
- Last paragraph of 2.1 you make no reference to the full mass range - which you actually do later on. This got me quite confused (we aren't going to do this... oh - wait - here it is!). You might intro the full breadth of this paper's work here. This was triggered by the first "full-mass-range" setup mentioned at the top of page 4.
- Section 2.2, first paragraph... I'm not sure - because it wasn't explicitly mentioned, but was any pre-processing done to the LHCO data? The reason I got confused here is that I didn't work with the LHCO datasets. So, it might be worth mentioning that you did no calculations or further feature engineering on the data (which is I think what happened after spending some time looking this up). This feels important because this is addressing a "real" challenge in that sense, not something contrived or made easier.
- Last paragraph on page 3, the term "mixed data" wasn't totally clear to me. In the previous paragraph you talk about the "R&D dataset". You also use the term "signal-region setup", is that just "the signal-region is defined as..."? Also, "setup" is used as a noun here. I wonder if one could rewrite some of this to untangle the set of terms to make it easy for the reader to understsand what data samples are being used for what.
- Figure 1 - I think my trouble with this boils down to the fact that cnotrol flow, and accessing tools all use the same type of arrows (solid, black). As a result there are weird bi-directional (flowing out, but going to two places, and it isn't clear why you would choose one or the other - "Initial Task -> "Task, Researcher", and then "write_python" -> "linting error" and "Local Machine"). I think I get it, but this could be made a lot more straight forward by using dashed arrows for information flow, and solid for control flow (or whatever you choose as best!). This started because I thought you said "Coder" doesn't do code execution, but there is a link to the large yellow "Machine" which contains code execution... from there I figured out these were services, which could be used or not depending.
- You talk about a single instance of the researcher and multiiple instances of the coder (sectin 3.2.1). I think what you mean is that the researcher gets one prompt context and keeps adding to it. New instances of the coder are identical other than they start with a clean context. It would be helpful if this was made clear.
- This was caused by the line "one researcher can employ several different coders" - but there is only one coder box - is done good for python, and the other for C++, etc.?
- You use footnotes for a number of things - like on page 6 - but these strike me as quite important things. Perhaps they should be lifted out of the footnote and into the text with an extra paragraph or few sentences?
- At the end of 3.3 you mentioned that GPT-5 can change its thinking. For the API, which you use, I don't think this is correct. You have to specify a thinking level (low, medium, high). The ChatGPT automattically does this - starts with a low, and then boosts it if the answer requires more effort. But that has to be implemented by you.
- Figure 3 - as you note later on, I'm not sure how informative the timing plots are for the LLM requests - it depends a lot on load and provisioning - something that we have no control over. I suppose the relative differences are interesting, but that plot needs disclaimers.
- 4.2.1 - the "(that would indicate a tension between the data and the background-only hypothesis)" - this is an intro to p-values... I know you don't want to talk about this - that topic is... fraught... but perhaps a reference or something earlier on when p-value firsts is introduced?
- First paragraph, page 15 - "could be aware" - they certainly are. And, actually, it would be good to record the knowledge cut-off date for these models. Some of them are, I think, after LHCO - so they may have LHCO code/etc., in their training datasets. As an example, I asked GPT-5.2 (more recent than the models you used) in the non-web mode and it know about LHCO.
- Section 4.2.3 - I'm not sure what to say here - generalizing prompt engeineering between different models isn't obviously possible. For example, the claims by Open AI were that 5 was much better at instruction following than 4.1 was - so in 5 there is no need to threaten end-of-world consequences. So, the conculsions are certianly true for 4.1, but generalizing them may not make sense and probably has to be tested.
- Section 4.2.4 - "In this section..." but the section is one paragraph long... Perhaps just start with a new paragraph here, and "We test..."
- Page 22, last paragraph, "A natural follow-up is the question..." - it is an interesting question to understand if we are actually pushing these models to the limit - and need a new more advanced model. Or we need a more advanced way of working with the current models we have (I currently tend towards the latter - if all model development stopped right now, we'd still have a few years of work before we were really exploiting these models to their full effect).
- The github repo didn't seem to have the Dockerfile's used to create the images you were running with singularity. Please add them.
- It might be good to package up the complete repo and upload it to zenodo.
- Since I couldn't see the Dockerfile, I couldn't figure out what tools (software packages) were used. If you are using any that contain citation files or references, please make sure they are cited in this paper. Please also include the Dockerfile(s).
Recommendation
Publish (easily meets expectations and criteria for this Journal; among top 50%)
