10 Comments
User's avatar
Thomas Larsen's avatar

Thanks for the post!

I agree with a lot of what you're saying at a high level -- in fact my median timeline to Superhuman Coder from AI 2027 is more like 2031.

I disagree that data is likely to be the key bottleneck, and am more sold on compute and algorithms as the main bottlenecks. Thought experiment: suppose the number of high quality internet tokens was 10x or 100x smaller. Would timelines lengthen a lot because we have even more of a data bottleneck? I don't think so.

A few specific comments:

> The Problem with Extrapolation

I think there are two arguments you might be making here.

1. My first interpretation was: To get AGI, you need to have broad capabilities. The time horizon trends depend on domain, and so naively extrapolating them isn't a reliable way to forecast AGI.

I think that's incorrect, because our model is that first AI research gets automated, then all of the other tasks fall (e.g. vision, driving, math, persuasion, etc). So we only care about the time horizon and capability level wrt coding/AI research, not the others.

2. Another thing you might be saying is: To get to superhuman coder to automate research, you need to be good at all sorts of domains, not just coding.

But I also disagree with this -- I think there are only a pretty small set of domains that we care about. For example, the Epoch post cites chess iirc. But I don't think that Chess time horizon will be a limiting factor for automating coding. I think that reasoning, coding, creativity, etc will be necessary.

> Moore’s Law & Paradigms

I'd be quite surprised if "RL with verifiable rewards doesn’t elicit new capabilities". Huge if true. But idk, it's confusing what this even means? For example, OAI and DeepSeek training shows huge performance gains from RL. I haven't read the paper yet, so might be misunderstanding.

I like the overlap of the two sigmoids, I think it's a helpful illustration that this trend is much much less of a steady curve than moore's law. I don't think we really have enough data to be confident about the extrapolation.

> Relevant Data is Key → Workflow Data is Key

Yeah, this is a classic crux. The million dollar question re: timelines is "to what extent do we need long horizon training samples". My guess is that the bottleneck is less data and more compute -- see the classic BioAnchors debates between Daniel and Ajeya. The short timelines view basically has to rely on huge amounts of generalization. Why might this be a reasonable view?

One intuition driving the short timelines view for me is that long horizon tasks can be factored into short horizon task. You can break down longer time horizon tasks into figuring out the first step, then executing it, then reorienting and thinking about what the next step is.

Expand full comment
Michelle Ma's avatar

Thanks for the response! You bring up good points, and looking at it again, I think there are some places where my reasoning is unclear / sparse.

>> The Problem with Extrapolation

My argument is not quite about different domains (like vision, math, etc.)—I’m more claiming that increasing a model’s time horizon (on just coding/AI research) entails advancing various capabilities that could be ~as distinct as different domains.

Also, “time horizon” conceptually bundles together many skills that are strongly correlated in humans (if you’re good at some, you’re probably good at all of them), but not necessarily in AI systems—Moravec’s Paradox, in a sense.

To clarify the example from the post:

The reason a young child can correctly fill in the blank for “The sky is ___”, but (probably) can’t implement binary search is because he has knowledge and understanding of the former, but not the latter. To become more capable, he has to wait for his brain to develop enough to grasp complex abstract concepts (like computer algorithms), and he also needs to actually learn the material. This is roughly similar to why GPT-2 can also answer “blue” but can’t write simple code: a 2 second task vs. a 5 minute task (as would be categorized by METR). In this respect, GPT-4’s improvement to the 5-minute coding time-horizon came from scaling model size, training duration, and training data—roughly analogous to a child both maturing cognitively and learning domain knowledge

However, even if a teenager, for example, is very good at coding up individual algorithms (shorter task), this does not mean that he is capable of managing a coding project (longer task), especially if he has exclusively been taught on the former. Project management seems to entail various executive function skills that are learned and practiced separately from pure CS knowledge. There are also tacit knowledge aspects of handling coding projects that are typically learned through experience. The deficits and solutions for the teenager seem different from those of the child. Similarly, o1’s increased time horizon came largely from RL on CoT prompting (~executive function) and RLVF on coding tasks (~domain-specific practice), rather than scaling.

The point of this example is basically that human & AI time horizons alike consist of distinct capacities, and so it's misleading to portray these as a homogenous quantity that can be laid across the y-axis & extrapolated from. This is like plotting "skill" on the y-axis, where values 0-5 are chess skill, 5-10 is math skill, 10-15 is robotics skill, etc. (perhaps less extreme).

>> Time Horizon & Input Length

Where Moravec’s Paradox might kick in is when you get to months-long time horizons. At that point, I argue that the important skill becomes the ability to reason over long/many inputs. Or if not reasoning, *retrieving*.

When I wrote this post, for example, it was the culmination of a couple months of interspersed reading & thinking, and I drew upon dozens of AI-related papers, posts/articles, courses, conversations, as well as general knowledge about writing technique & organization, etc. Human working memory is tiny, so obviously I wasn’t thinking about everything at once, but instead retrieving & reasoning over a few pieces of information for each micro-task. And these facts were the *very* specific & narrow pieces of the overall corpus particularly relevant to the task at hand.

Models are good at pinpointing this important information in shorter contexts, but sophisticated reasoning consistently degrades over even <20k tokens. This is likely due to attention dilution, which is an inherent constraint on long context windows. This suggests that a retrieval mechanism (or similar) is necessary to handle longer inputs (to break the input down, store, & retrieve like humans do). However, while humans seem to have some innate capability & implicitly acquire this skill of highly dynamic and precise retrieval, it doesn’t seem like models have. Even having multiple needles rather than a single needle in needle-in-haystack-style tests significantly degrades performance (even with RAG), suggesting a challenge that doesn’t reflect human intuitions about difficulty & skill.

>> Generalization & breaking down long tasks

This is precisely the difficulty of “reorienting”, which might be much simpler for us than for AI. It is a long horizon task not in the sense that you necessarily need to reason over long/many inputs, but that you need to at least retrieve over them. You need to identify the specifically relevant starting parameters and end goals, the intermediate steps leading to the current state, past pitfalls to avoid, and immediately action-relevant facts, while ignoring currently irrelevant information (i.e. most of it, in long tasks), as well as recognize when all existing retrievable inputs are inadequate/irrelevant and further searching is required.

Models are currently bad at this, and it’s such a complex & nuanced skill that it’s hard to imagine it will emerge given very little relevant data (long horizon workflows). Going back to the analogy used in the post, it seems closer to training a model almost exclusively on 3-token fragments and expecting it to learn grammar rather than merely having 10-100x less high quality data. Relatedly, as far as I can tell, some otherwise intelligent people are not so great at complex/long-term projects because they're not sufficiently skilled at this sort of process.

I don’t claim that models can’t learn human-level retrieval, but in the same way that the transformer algorithm relies on extensive training data to produce a capable model, it seems like retrieval algorithms would have similar data requirements to achieve adequate capabilities.

>> Compute bottlenecks

I agree that compute seems like a bottleneck as well, but I’d argue that workflow data is even scarcer relative to its necessity. Moreover, this isn’t specified in the post, but this data bottleneck seems like it could not only hamper timelines, but takeoff as well. If it’s the case that retrieval processes differ significantly between a months-long project and a years-long career, then generalization could become an enduring issue, rendering algorithms & compute insufficient for creating a superintelligent agent, even with a superhuman coder helping. At this point, collecting the relevant data seems substantially more time-consuming & difficult, perhaps precluding hard takeoff.

But maybe I’m missing something from the biological anchors debate—let me know.

--

Note: On RLVF: “new capabilities” might be confusing/misleading here, but the paper basically found that RLVF improved performance by increasing efficiency, not by raising the upper bound for capabilities. As in, the RL models performed much better under pass@1, but when both RL models & base models were permitted many attempts (e.g. pass@128), base models consistently outperformed the RL models (more likely to pass within the 128 trials). The authors suggest that this might be because RL reduces response diversity. But then again the paper only involved open-source models, so perhaps this isn’t the case with frontier models.

Expand full comment
Harjas Sandhu's avatar

This is an excellent post. To add to the argument,

> A fifth of new businesses fail within the first year, and the majority fail within a decade—unwittingly training on these processes seems undesirable, especially if you’re interested in creating a highly capable agent.

Even worse, it’s possible that some of these business were executed well and just got unlucky, whereas other successful businesses might have used bad processes and gotten carried by other structural factors like lobbying or capital.

This is also assuming that there are in fact generalizable lessons to take from business processes. I think that’s probably a true statement, but your opinion depends on your views about business successes and failures. How do you teach an AI to stop “resulting” when its entire training paradigm is probabilistic?

Expand full comment
Michelle Ma's avatar

Thanks!

Tbh, I might've misspecified this point because really, the key thing we want the AI to learn to do is to like, not forget things or randomly change course on the fifth repetition of a task, and other simple human abilities probably enshrined in any business process. Making an AI that can independently run a business that fails in a few years or that can succeed if 'merely' carried by lobbying/capital is still probably game-over for many white collars. But if intelligence explosion depends on a top percentile superhuman coder, that seems perhaps different; might've been more suitable for the example. But this also might be exaggerated bc once you have baseline ability (e.g. grammar, long-context reasoning), adding on high-level special skills (e.g. competition math problems, coding) seems to require much much less data for post-training.

Also probablistic =/= resulting I think? Acting probabilistically is often optimal (EV)

Expand full comment
Matt Reardon's avatar

Am I right to think that the overfitting problem is doing a lot of the work in the background here? Big models with comparatively little or low quality data start missing important generalizations because they spent all their FLOP memorizing their poor/small data when there was not enough relevant data to generalize over?

Expand full comment
Michelle Ma's avatar

Hmm I think it depends on which situation you're referring to?

For context, overfitting/underfitting generally has 2 possible causes:

- The model is too complex / too simple

- The training data is not representative of real-world data due to undersampling*

W.r.t. current models' long context capabilities, it's less an over/underfit issue & more that the model can't find the 'fit' at all because of attention dilution. It's like the difference between mistakenly thinking that because some details are salient in one case -> they're important in all cases, vs. not even understanding which details are important within the context of a single case.

W.r.t. current algorithmic alternatives for enabling long context reasoning, it's actually closer to underfitting because e.g. semantic similarity measures used in retrieval algorithms are too simple to enable sophisticated reasoning.

W.r.t. training more dynamic & precise retrieval / compression / etc. algorithms -- if data is lacking and/or poor then yes the model will fail to generalize, but it could either overfit or underfit, depending on the nature of the training vs. validation data. Overfitting might be more likely though.

* Note that this is generally distinguished from cases where the training data is not representative because it is sampled from the wrong distribution (as opposed to undersampled from the right distribution), although the boundary between these two cases is sometimes poorly defined

Expand full comment
Matt Reardon's avatar

I'm out of my technical depth, but I guess I was imagining a case where you used e.g. *only* 1,000 examples of one hour computer tasks to train a giant model (10^27 or something crazy); that model would overfit like crazy and not be useful.

I guess the more realistic claim is that you have a general LLM pretrained on 40t tokens *which include* 1,000 hour-long task examples and it basically can't even pick out the examples and connect them to other concepts in its training corpus strongly enough to generalize at all. Tasks just seem like noise, so it reaches for some more common concept that isn't anything like one of the 1,000 anomalous-looking task examples. Is that attention dilution?

Expand full comment
Michelle Ma's avatar

Yea in the first case it would definitely overfit, probably to the extent that its outputs would be memorized gibberish since the disparity is so large.

In the second case, there's two things happening.

1) During training: The long task data is a tiny fraction of the total training data, so the model's overall learning is completely dominated by the other data & yea it basically can't pick out/connect the examples. This is just a data sparsity problem, separate from attention dilution.

2) During inference: The model basically has a limited amount of attention it can spread across the input sequence, so for longer inputs, most of it just seems like noise. The resulting output is very simplistic and arbitrary. This is attention dilution.

Expand full comment
Matt Reardon's avatar

Ah I think I was missing something very basic here and trying to make this about pretraining in a very simplistic sense.

The issue is: in computer use, you have these continually compounding inputs. The initial prompt might only be two sentences of text, but agentically searching through multiple pages filled with buttons, information, implicit options, and problems and remembering which worked and didn’t for which reasons as you go forward and search ahead is a lot more ~cognitive load than writing up the paragraph of text that most naturally the previous one or filling in an outline with plausibly relevant details. Failing to give sufficient attention to a single button somewhere in all of this can cause the whole task to fail.

Expand full comment
Michelle Ma's avatar

Yep! Also see my response to Thomas's comment

Expand full comment