SciPost Submission Page
Large Language Models -- the Future of Fundamental Physics?
by Caroline Heneka, Florian Nieser, Ayodele Ore, Tilman Plehn, Daniel Schiller
Submission summary
| Authors (as registered SciPost users): | Ayodele Ore · Tilman Plehn · Daniel Schiller |
| Submission information | |
|---|---|
| Preprint Link: | scipost_202507_00080v1 (pdf) |
| Code repository: | https://github.com/heidelberg-hepml/L3M |
| Date submitted: | July 29, 2025, 2:47 p.m. |
| Submitted by: | Daniel Schiller |
| Submitted to: | SciPost Physics |
| Ontological classification | |
|---|---|
| Academic field: | Physics |
| Specialties: |
|
| Approach: | Computational |
Abstract
For many fundamental physics applications, transformers, as the state of the art in learning complex correlations, benefit from pretraining on quasi-out-of-domain data. The obvious question is whether we can exploit Large Language Models, requiring proper out-of-domain transfer learning. We show how the Qwen2.5 LLM can be used to analyze and generate SKA data, specifically 3D maps of the cosmological large-scale structure for a large part of the observable Universe. We combine the LLM with connector networks and show, for cosmological parameter regression and lightcone generation, that this Lightcone LLM (L3M) with Qwen2.5 weights outperforms standard initialization and compares favorably with dedicated networks of matching size.
Author indications on fulfilling journal expectations
- Provide a novel and synergetic link between different research areas.
- Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
- Detail a groundbreaking theoretical/experimental/computational discovery
- Present a breakthrough on a previously-identified and long-standing research stumbling block
Current status:
Reports on this Submission
Report
This manuscript presents a novel investigation into the application of pretrained Large Language Models (LLMs) for numerical tasks in fundamental physics. The authors adapt the Qwen2.5-0.5B model to analyze and generate simulated cosmological data from the Square Kilometre Array (SKA). The core methodology involves replacing the LLM's standard embedding layers with “connector networks” to interface with numerical physics data, effectively reprogramming the model for a new modality. The authors systematically compared the performance of their adapted LLM against several baselines and showed that the out-of-domain pretraining on linguistic data provides a significant and consistent performance advantage in the context of a small dataset (~500 0 samples in their experiments). Overall, the paper is well-structured and clearly written. I listed below a few questions that should be addressed or clarified before publication:
-
The study is based on the Qwen2.5-0.5B model. In the LLM world, this is a relatively small model. A key question then is how the results shown in this paper scale with the model size. Would a 7B, 70B, or even larger model provide a correspondingly larger advantage or not? While running experiments with larger models is likely out of scope, some discussion/comments on the scaling behavior would be very helpful.
-
A related question is the scaling behavior with the sample size. An important question is whether the LLM-based approach is universally advantageous for all dataset sizes, or only when the available dataset is small (and if so, where is the turning point). Again, running experiments with more data is likely beyond the scope due to resource limitations, but it would be nice if the authors could comment on this or refer to existing literature in other fields.
-
p.15: In the “Training and reference networks” paragraph, it is mentioned that “In addition, we insert a copy of the test dataset into it.” Could you please clarify how it is done? Does it leak any truth information from the test dataset into the training?
-
p.16, Figure 4: Why are the loss curves of the two reference networks shown as flat lines?
-
Figure 4: From these plots, one would conclude that training the larger reference network from scratch gives much better results than adapting the pretrained LLM. Is that correct? If so, what is the benefit of adapting an LLM?
-
Figure 4: Would finetuning the backbone further improve the performance?
-
p.18, 2nd paragraph: Should it be 12,000 patches, or 1,200 (= (140140) / (1414) * 12)?
-
Table 3: For the finetuning experiment, is the same learning rate applied to all layers, or different LRs are used for the backbone and the connectors?
Recommendation
Ask for minor revision
Report
The idea is to re-purpose an LLM backbone for physics data in analogy to finetuning. The modality is changed from language to physics data, a sort of model reprogramming.
Data of 21cm background fluctuations and associated standard cosmological tasks are used as a testbed for this re-programmed LLM.
The first test is to completely freeze the backbone transformer, training only the connectors, and compare against
a network where the weights of the backbone transformer are re-initialized.
Comparison to a small network illustrating the performance for a comparable number of trainable parameters
as the L3M connectors and a large network illustrating the ultimate performance of a dedicated network are provided.
Improvements are seen for the pre-trained vs. the randomly initialized network. This is demonstrated as a lower loss in Fig. 4. However, the difference between random and pre-trained is comparatively small wrt the two references [32k and 990k]. This indicates that there is still a lot of room for improvement [to the 990k] and that the 32k model is too small to get a good enough performance (also consolidated by Fig. 5).
Interestingly, Fig. 6 does not show an appreciable difference between any of the three: pre-trained, random, 990k. The reader is wondering if there is not a task and associated metric which has a higher sensitivity to the variation of models?
The second set of results allows for a fine-tuning of the backbone. Also, the downstream task changes.
The reader would, however, have liked to see the same task with and without fine-tuning.
Fig. 8 shows a similar trend as Fig. 4.
LoRA is presented as a promising technique to safe compute and finetune only a small fraction of the weights.
Requests:
Fig. 8: please align legend to the three sub-figures for easier interpretation of the results.
Table 4: difficult to compare numbers which are mainly similar in the first 2-3 digits. The reader is wondering if ratios could be shown instead or some other visual representation could be found.
Visual inspection of Fig. 9 is very difficult. It would be useful to have an associated performance metric to guide the reader.
Summary:
Overall, the paper is very original and certainly explores new ground here. The results are promising, though it remains to be seen if the marginal improvements lead to real measurable physics gains in the matrix of compute resources vs. performance.
Recommendation
Ask for minor revision
Strengths
Weaknesses
Report
Requested changes
1. Introduction: The claim “transformers have been shown to be the best-suited architecture for both representation learning and generation” lacks a reference.
2. Introduction: recommend to also cite JetClass II in the [40,41] citation bracket.
3. Introduction: The authors ask the question whether the extreme gap in scale between LLMs and the typical physics networks can compensate for the data modality shift. I would like a bit more motivation behind the idea, for example a few references (if I understand ref 44 correctly, this is not related directly to this specific question as they don’t use the LLM directly on numerical data), as one normally wouldn’t think there would be anything in the structure of language that would be applicable to physics data, given that the modalities are so wildly different.
4. Section 2: In order for this section to be useful to a reader that is not already familiar with all of the material, care needs to be taken to explain all elements introduced in the expressions and equations. Simple example: the theta in eq 3. It’s obvious to me what this theta refers to, but this might not be the case for the general audience (especially if this audience is expected to need all this detail to understand how the model works). Another example is eq 22 where the policy is introduced, but it’s not mentioned what policy means or what the significance of it is.
5. Section 2: Consider cutting or moving items that are not crucial for the understanding of the paper, in order to avoid confusing the reader. For example, the concept of RL does not appear in the main text, only the appendix.
6. Section 3: Towards the end of section 3.1, it would be good to either 1) mention what type of data you expect to input as the linguistic-coded token, as this is quite mysterious at this point, or 2) refer to the section in which this will be explained.
7. Section 3: Beginning of section 3.2: does SKA have a reference? I also believe the acronym is not spelled out anywhere in the paper, apart from inside a prompt box.
8. Section 3: A few selected visualizations of the data would be helpful here.
9. Section 4: Heteroskedastic loss does not belong to the most common loss functions. A reference here would be useful.
10. Section 4: Eq 36, motivate this choice or mention what it is called.
11. Section 4: Add a line explaining how eq 37 illustrates the point you want to make about unbiasing the covariance matrix.
12. Section 4: It is at this point still unclear to me how the “mixed” setup of section 3.1 (linguistic token and numerical token) works. This has to be explained at the latest at this point, as the text mentions tokens that the LLM already has pretrained embeddings for.
13. Section 4: It is unclear whether the idea to update the weights of the connector blocks separately is the authors’ own invention. If not, a reference or statement of the origin should be added. And similarly for the idea to duplicate the test dataset, original invention or reference/origin should be stated.
14. Section 4: Re-initialization of the scaling factors of the final layer norm – is this an empirical find, or something known in the literature (in which case a reference is needed)?
15. Section 4: Why do the two reference networks not have a causal attention mask? What motivates introducing this difference to the original Qwen architecture?
16. Section 4: The larger of the reference networks is said to “illustrate the ultimate performance of a dedicated network”, but it is then stated that they authors “do not care about its best possible performance” in the context of hyperparameter selection. The statements appear contradictory. Was any type of hyperparameter scan performed at all, in order to roughly get the “ultimate” performance out of this network?
17. Section 4: In general, the reasoning behind the choice of reference networks should be made clearer, with any caveats explicitly discussed. Qwen is several orders of magnitude larger than the reference networks, but must on the other hand adjust to the new domain.
18. Section 4: Fig 4, top row, center and right plots are missing one gridline.
19. Section 4: The observation in Fig 5, that increasing the depth of a randomly initialized backbone helps the performance, even though the backbone is not trained, is very interesting. I would like to have seen a discussion about this phenomenon, including relevant references.
20. Section 4: The method used to produce fig 6 is not entirely clear, please rephrase to make it easier to understand.
21. Section 5: The sequence length is kept reasonable by restricting to 12 consecutive spatial slices. How many slices do these lightcones normally have?
22. Section 5: It is not entirely clear to me where the CFM is plugged in, or where LoRa is applied. Perhaps a sketch of the complete architecture at this point would help the reader understand how the different parts fit together. This would also make it clearer what is actually being finetuned, as described on p 20.
23. Section 5: It seems the reference networks in this section are chosen in a different way compared to in section 4. Please explain the reasoning behind how the reference networks were chosen here.
24. Section 5: Fig 8 caption is confusing -- do all curves show the average of 5 finetunings, except 1 curve that shows the average of 7?
25. Section 6: “the pretrained LLM backbone outperforms the reference networks” – is this true also for the parameter estimation case? The end of section 4 seemed to suggest that the large reference network performed better.
26. Section 6: “For the regression, we also observe this behavior in the randomly-initialized network and identified a scaling with the number of transformer blocks in the network.” – I don’t see this mentioned in the main text.
27. References: Check capitalization in titles
Recommendation
Ask for minor revision
