Back to our Regularly Scheduled Programming
Back to our Regularly Scheduled Programming: an update on AI capabilities, corporate AI adoption and hyperscaler AI revenues vs spending
With some kind of tariff equilibrium possibly within reach, we return to some regularly scheduled programming: artificial intelligence and language models which were the primary drivers of equity markets before the trade wars began. To be clear, our estimate of the US bilateral tariff rate on China would still be ~40% after incorporating all the individual pieces announced so far. But with some tariff clarity, markets may be able to refocus on other things. That’s what companies are doing: they focused a lot more on AI than tariffs in Q1 earnings calls. This note covers massive hyperscaler AI spending, the improving capabilities of AI reasoning models, increased signs of corporate AI adoption and the scavenger hunt for growth in hyperscaler AI related revenues.
Morning. Welcome back to the Eye on the Market podcast. This one is mid-May, and it’s called back to our regularly scheduled programing and update on—I, I’ve spent a lot of time this year on the intersection between politics and economics and markets, for good reason. There was a flurry of executive orders, and memorandums and proclamations on tariffs, which was a catalyst for the first Sell America episode since 1982.
And Buy, Sell America, I’m referring to material and simultaneous decline in U.S. equities: The dollar or Treasury bonds combined with U.S. equity underperformance versus the rest of the world, like Blanche Dubois and Streetcar Named Desire. The U.S. relies a lot on the kindness of strangers, and one of the first charts in this piece shows how much the U.S. is now reliant on foreign versus domestic net savings.
So U.S. is almost entirely reliant on this basis anyway on foreign net saving. So a, a Sell America episode is not a good one, but it looks like for my 62nd birthday, I think that’s right. Trump is going to set the China reciprocal tariff rate at 10% like the rest of the country, the rest of the other countries, in which case, we’ve updated our tariff rate on all U.S. imports.
And it now looks like if we assume this temporary negotiation holds, that we’re approaching an equilibrium state, which is a roughly 10% reciprocal tariff on a bunch of goods, other goods get exempted and then other goods get subject to Section 232 product-specific tariffs. The big picture is that you’re still looking at the largest tariff increase in, you know, 70 years or so.
But it’s a lot lower, roughly half of what it was, let’s say a month and a half ago. So now that that’s happening and we’re approaching maybe some kind of steady state that countries and companies can adapt to, let’s go back to some regularly scheduled programs, which is an update on AI, which was the primary driver of U.S. equity markets for all this before all this trade war stuff began.
And even during all of this trade stuff going on, U.S. companies spent more time on Q1 earnings calls talking about AI adoption than they did about tariffs, which is interesting. And another thing to keep in mind is that the market capitalization of companies that benefit directly or indirectly from AI is two-and-a-half times larger than the market cap of the company, U.S. companies, that would be the victims of tariffs.
So I think you could make the argument that AI is at least as important as tariffs to equity investors, if not more so. At the same time, the premium one would pay for AI plays relative to the stock market is, is back down to the level it last was in 2017. So it’s probably a good time to be taking a look at this.
So one of the things that happened during all the tariff stuff and the Sell America discussions was a lot of people were writing how U.S. equities are very, very expensive versus the rest of the world. And if you just look simply at PE multiples of the U.S. versus Europe or Japan, that’s what you would probably conclude. But equities can be cheap or expensive relative to each other for certain reasons.
And I always remind people, U.S. companies are a lot more profitable than their non-U.S. counterparts. And so if you look at that, we have a chart in here I’ve shown on this page for people watching, if you plot ROE against price-to-book, for example. In other words, fundamentals versus valuations, there’s a very linear relationship. The more that a sector, the higher the sector ROE, the higher the price-to-book ratio.
And so on that basis, the U.S. equities don’t look quite so mispriced relative to the rest of the world. Here, we’re comparing it to the developed world ex-U.S. We have a number of different ways of running the chart that will tell you the same message. Now the top dot on here is of course the tech sector, which has the highest price-to-book ratio, but also by far the highest projected return on equity, and as a sign of just how successful the tech and interactive media space has been, that sector now accounts for 35% of all the earnings in the market compared to just 19% a decade ago.
And over the last couple of years, the primary driver of, in the tech space, has been AI adoption. So that’s what we want to focus on. So here we refreshed, Stanford does this, and we refreshed their chart on how AI capabilities are advancing. They’ve been generating this chart for several years. And it, and it looks at how AI models do versus humans on a number of different things related to classifying images: visual reasoning, language understanding, math, science, things like that.
And, and AI capabilities have now more or less matched or exceeded humans. And at the same time, costs have gone down a lot. Increasingly small but very powerful models, inference costs for a system performing, let’s say, at the level of GPT three and a half, dropped by almost 300 times between November 22nd and October 2024. And also hardware costs are declining. Energy efficiency is improving.
Open-weight models are closing the gap with closed-weight models. So there’s a lot of things going on. I think it’s now that the tariff equilibrium appears to be set, I think it’s time for us to refocus on, on these other things. So what I want to do is quickly walk through, and this is what’s in the AI in the market this week, I want to walk through the things that are most visible to the things that are least visible.
What’s most visible is the increase in hyperscaler spending. But whether in dollar terms, share revenues, I think it was almost the way we track it. We look at capital spending plus R&D. And just for the for big hyperscalers, it was $450,000,000,000 in 2024. And expected to be substantially like 30% higher that in 2025, which is kind of amazing.
So that’s the most visible thing you can see. Then the next visible thing that you can see is improving capabilities of AI models on, versus tasks and exams and things like that, you know, on paper. Then the next visible thing to see, which is not as visible, is AI adoption by the corporate sector. And then the hardest thing to find is the, is to understand the true pace of AI-related revenue growth of the hyperscalers, which is really an important thing.
So right now some of these things are a lot more visible than others. The hyperscalers continue to live by this mantra: We have more to lose by underspending than overspending. Okay, great. But at least we can see some evidence of AI adoption and revenues associated with it. So when I first started writing about language models in February 2023, there were a lot of questions about hallucinations, and just how relevant my language model scores on multiple choice exams were when the models were trained on the answers to those exams.
So all you were really getting was a sense of are they good memories? And yes, they are good memories. But what progress has been made on, on a number of fronts? The models are now tested against way more advanced exams than simple multiple choice. And while you can’t eliminate the contamination issue entirely in most cases, a lot of these models are doing much better on graduate level science questions that require multi-step reasoning across physics, biology and chemistry.
And, and they’re doing better on math questions. That involves symbolic reasoning, in algebra and combinatorics and number theory, and not just pattern following and guessing the next word or guessing the next number. So a lot has been done since, since, let’s say over the last two, two and a half years. And here’s a chart, for example, on how the different language models are doing on this Google-proof human Q&A test.
In other words, things that you can’t find the answers on Google, or at least not very easily, and from, from mid-’23 to, let’s say, the fall of last year, the models were still languishing in the 30 to 50% range in terms of scores. With the advent of some of the reasoning models, those scores have gone up to, let’s say, 70 to 90% across most of the models that you look at.
Similarly, reasoning models have really helped how these models do on, on math. So the next chart is one on this U.S. math Olympiad selection exam. Again, the models were languishing with a, really crappy scores. I don’t know if that’s a client- or a compliance-approved word, but whatever. In 2023, and then late last year in the fall, again, around the same time that the models started doing better on the Google-proof exams, they started doing better on some of these math exams.
With the advent of the reasoning models, whether it’s Claude or Gemini, or, or O3 or things like that. Now those are just exams, right? And exams don’t have a ton of practical use in the real world.
Now, to be clear, a lot of these tests and exams and tasks are things that a lot of these models are scoring well on, but only after all of the AI model builders torture their models to do well on them. And so, we also have to take a look at the hallucination rates that some of the new reasoning models are experiencing in the wild when they're not working on certain preset exercises and they are very high.
You know, this hallucination right now is a big problem for reasoning models. One possible explanation is that some of them are recursively sampling based models that have single digit hallucination rates, but if you keep sampling it multiple times, you're going to end up with a very high hallucination rate. Some people think this is more of an engineering problem than a science problem.
There may be paths around it, but the bottom line is look at this table. The hallucination rates of OpenAI's suite of reasoning models is, roughly in the neighborhood of 50%. And sometimes it provides broken links. Sometimes it even describes steps that it did in interim computations that it didn't even do. I mean, it tells lies and falsehoods that, my four year old used to do, and that's how easily some of them are identified.
So I think it's it's important to understand that some of the improved proficiency that we're seeing on the prior charts and pages are things they've been trained to do better at. Whereas, used on a kind of random, broad basis, by the, by the rest of us, by the diaspora of users, we're going to be much more subject to hallucination.
Risks. This is something called good hearts law in spades. Good hearts law means once a benchmark becomes widely acceptable, it tends to lose its value because people game and manipulate the outcome. The best example is in colonial India. They had a problem with cobras, and so they paid people to bring in cobras so they could kill them.
So then people started to breed cobra so they could bring them in and get paid for their cobras, and they ended up with too many cobras. Anyway, it's, it's very important to understand that the hallucination risk for the reasoning models are still very high and is still a problem that has to be solved. It means also that in corporate applications, all of those kind of hallucination risks have to be bred out of whatever process that those reasoning models are used for.
They’re interesting things to look at. What’s more interesting to us is how do they do on coding? And these models are now being used to see how they can do in terms of writing and editing code, and here this test looks at the ability of these models to execute over 200 tasks in multiple coding languages.
Some of them are still only getting a little bit more than half of these exercises correct. But other ones like the ’04 Mini and Gemini two-and-a-half Pro are, which is Google’s product, is, is doing much better in the 70 to 80% range. And remember, there’s something important about models like this. If you’re dealing with a system like air traffic control, self-driving cars and interpreting people’s MRIs, mistakes are catastrophic.
And so a model that scores less than perfect is a problem. But for most tasks that a lot of these things are being used for and might be used for, there’s the ability to both apply human intervention and also other models to come in and clean up mistakes. So when the consequence of mistakes are not catastrophic, I think most model success scores less than 100% can be perfectly viable and still ,and can still add to productivity.
Here’s another score. Same thing. I model coding competitions within OpenAI’s own universe. When the reasoning models kicked in last year, the score started to go up substantially. So we get into all the details on this. That’s enough of that technical stuff. Actually one more because I think this is important too. How complex a task can these models try to tackle?
When we first started talking about these models a couple of years ago, we were using them as benchmarks against just looking for an answer to a question in Wikipedia, whereas now we’re asking them to write emails, create websites, analyze data sets. And so a recent paper in Nature magazine looked at how long these models can stay on track while working through some very complex, multi-step problems.
And these things have improved a lot. So if you want more information on that now, the models still struggle with certain real0-world issues. There are certain, I think, more valuable tests that look at things like can they fix bugs or add features in GitHub repositories? So far, only Anthropic’s Claude product has, has more than a 50% score on this.
And there’s other examples as well. There’s something called Humanity’s Last Exam, where most of the models still do quite poorly. And then I also asked GPT4 to draw a map of Europe. And it did a hysterically and hilariously bad job even after when I asked it to try and fix it. It referred to it labeled London as, the city as bland. Now, no argument there, but you know that’s a mistake. And then, and that’s in the appendix of the written piece, I think the most important thing is other than the coding exercises, none of these benchmarks really have any impact on a chief technology officer. Officer that’s thinking about enterprise adoption, of AI or, or things to drive their business impact through enterprise use cases.
So that’s where we’re going to look at next. Enough with all of these theoretical exams and things. What’s going on in the real world? Well, I’m not a huge fan of McKinsey surveys. I think there’s questions about rigor and thoroughness, and things like that. My favorite study about consultants came out a few years ago, and shows that when you hire consultants like McKinsey, the most likely outcome is that six or nine months later, you’re still hiring consultants like McKinsey anyway.
That said, they did a survey of about 1,500 companies, and asked them questions about how much over the next three years do you think you’re going to reduce employees? How much are you going to reduce overall costs? How much your revenue is going to go up? And to simplify this, this is this, survey, the, the good news is that around 50% of all respondents said that they expect employee cost reductions and revenue increases from adoption of generative AI.
That’s the good news. The bad news is the most frequent answer in almost each case of the people who said it was going to help, was the smallest amount of help. In other words, the most frequent three-year reduction expected was 3 to 10% of employees, rather than 11 to 20 or more than 20, etc. Same story with revenue. The most frequent response was it’ll help us by less than 5%. But that said, it does show that AI adoption is increasing in real-world business cases. And we’re getting the same story from a Bain survey that was just completed, that looked at the change over the last year, the AI adoption cases in, you know, have gone up by 50 to 60%.
And the census also does an interesting survey where they look at adoption rates by sector over time. And you’re starting to see adoption rates of 20 to 30% in some of the sectors where you’d expect to see them. I thought an interesting anecdote was, and this is from one of our AI researchers, there’s a Mexican used car platform that told our researchers they replaced their entire outbound sales team with an AI voice model powered by an, you know, a generative AI platform. And most of the time, customers can’t even tell they’re speaking to a voice bot. And the voice bot does a better job converting customers than the human baseline they’re comparing it to.
So another couple of things, there’s been a sharp increase in FDA-approved medical devices that rely on AI, and I want to finish up with something about the FDA at the end. But there’s been an increase in the use of medical devices.
And then there was an article in The Atlantic that I thought was interesting that, that proposes the idea that AI is starting to impact the job market. And they show this chart as an example of that, which is for the better part of the last 30 years, the overall unemployment rate was higher than the unemployment rate for recent college graduates, whereas since 2020, which is, you know, before some of this AI stuff really got kicked in. But since 2020, that number has been falling. So recent graduates have higher unemployment rates in the overall market and what recent college graduates do. They summarize information, they aggregate data and they create charts and tables and graphs. And if that’s what AI is getting better at, and if that’s what I was being used for, that would help explain this.
So this is indirect evidence at best. Now, last September when I last did an AI update, I expressed a lot of concern that the hyperscalers were spending a ton of money, and that we would pretty soon need to see hard evidence of them starting to get a return on all that investment. And at the time, I cited an analysis by a guy named David Kohn at Sequoia, where he backed into using his own assumptions how much the industry would need to earn every year, assuming certain capital spending and gross margin of the hyperscalers. And he got a number like $500 billion a year in annual incremental AI revenue. Now, he assumed the requirement for a very rapid payback period. But even if you relax that constraint, you need some very big-figure AI revenues for these companies, given that they’re spending hundreds of billions of dollars on CapEx, on R&D every year. So what are we seeing?
Well, first of all, the hyperscale or capital spending in R&D as a share revenue figures are starting to creep up again. So in 2022 to 2024 for Microsoft, Google, Amazon, those numbers were, were kind of plateauing at around 25%. So in other words, they were spending more, but they’re also earning more. Now those numbers are starting to go up. And then we’re at 30 to 35%. So we have to start to watch this because it seems like capital spending growth is overtaking the overall revenues of these businesses. Now the good news is Microsoft gave us some clues. As far as we can tell, they’re the only one of the hyperscalers that’s giving you hard data on how much they’re earning from AI. On a trailing one-year basis, it looks like 3 to $3.5 billion. Obviously growing substantially, and in terms of year-on-year rates.
And then another interesting observation: They said they processed 100 trillion tokens in Q1 2025; 50 trillion of them were just in March. Obviously that’s a super jargony thing for them to say. But when you translate it into English, but based on what we know about the way these models work, it means that there’s a lot of inference activity going on, and not just model training.
So there’s an inference. Models are typically a sign that corporations are adopting AI models, using them in actual workflows. So that was a good sign too, and for the other hyperscalers, their AI revenues are buried inside the cloud. And we’re starting to see quarterly revenues, particularly for Amazon, Microsoft and Google, start to pick up, although they’re kind of flat for Oracle and IBM. So we’re seeing some evidence that AI revenues are picking up when we look at the cloud.
But at the end of the day, the most important chart is this highly cyclical one. But it’s, it’s, it’s what’s going to go on with how long can these hyperscalers keep this going? And that’s going to be a function of their overall free cash flow margins. Amazon’s is still negative. But for Meta and for Microsoft and for Google, they’re still hanging in with 20 to 30% free cash flow margins. As long as that’s the case, I think they’ll be able to keep these capital spending wars going. But that’s the thing we have to watch the most. If you start to see a sustained dip for Meta, Microsoft and, and Google, below the 20% level in terms of free cash flow margins, I think the markets would be very concerned that the capital spending is getting way ahead of the AI revenue generation.
So what happens next? Our people—you know, we have a lot of AI going on inside the company—our people tell me there’s too much focus on software developers. They only spend around 30% of their time actually coding. So even if you approve their productivity in coding by 50%, you’re still only talking about a 15% productivity improvement.
And they think the larger gains, instead of coding per se from generative AI, are things like software maintenance, unit testing, integration testing, performance monitoring. These are harder things to measure, but they think the savings potential is much greater behind the scenes. I think Microsoft and Amazon are likely to accelerate efforts to build their own foundational models. There’s a lot of stuff going on that you can read about each week in the press on the ongoing divorce between OpenAI and Microsoft.
Amazon, Google and, and Microsoft are trying to manufacture their own GPU-like chips to break Nvidia’s stranglehold on the market. And there’s a bunch of AI adoption milestones to watch for over the next couple of years in terms of self-driving cars and drones, and multimodal, AI used in entertainment, personalized AI system and things like that.
But at the end of the day, looking back to the 1990s, this is the biggest capital spending experiment on record by the tech sector. We’re now setting new, consistent highs in terms of capital spending in R&D as a share of revenues. And so the bottom line is this thing better work. And, and the adoption rates are going to have to continue to start going up pretty soon.
I did mention AI approved, the AI and machine learning approved, involved medical devices that have been approved by the FDA.
Speaking of the FDA, I don’t know if I should be drinking so many Frescas. I have no idea if that’s healthy or not. So I did want to mention one thing about the FDA: The drug approval rates have fallen in half, at least in Q1 of this year. And I was thinking that, about that recently when I found out people immunized between 1963 and 1967 for measles, mumps, rubella received an inactivated version of the vaccine. And these, and it was just that brief period because the live vaccine wasn’t approved, but pardon me until 1967. So the problem is the, in the inactivated version of the vaccine gives you much lower immunity than the live one, and some of the studies show that after getting the vaccine, only a quarter of the people still had detectable antibodies at some point later.
So the CDC is recommending that individuals vaccinated during that period get a new live vaccine vaccine. But for certain medical conditions, like one I happen to have, you can’t get live attenuated live vaccines because they’ve got live attenuated viruses, and they’re not good for people that have certain kinds of immune deficiencies. And so, you know, that’s that. So now people like me and others are being negatively impacted by all the people that are deciding that they don’t want to get vaccinated anymore for MMR even though it’s 97% effective against the spread of measles. And just to give you a sense, the, the infectiousness, the infectiousness measure of COVID snf the flu is like one to two for polio and smallpox. That’s five or seven. And for measles, it’s 12 to 18. And but what’s going on in the country is over the last 10 years, the vaccination rates have fallen from, nationwide, 95 to 92%.
But there’s a whole bunch of states below 90%. Georgia, Colorado, Wisconsin, Alaska and Idaho has fallen to 80%. And in Gaines County, Texas, where a lot of the measles cases have occurred, vaccination rates are 80%, and one school district is below 50%. And, and there was a study recently from Stanford that estimated that measles could become endemic again within two decades.
Given these declines in vaccination at the same time, instead of consistently messaging the importance of the vaccine, RFK, Jr. has directed health agencies to explore potential new treatments for people to get vaccines. They get measles, including vitamins and cod liver oil. I think it’s a good time for me to stop this podcast right there. Thank you for listening.
And we’ll see you again next time. Bye.
Morning. Welcome back to the Eye on the Market podcast. This one is mid-May, and it’s called back to our regularly scheduled programing and update on—I, I’ve spent a lot of time this year on the intersection between politics and economics and markets, for good reason. There was a flurry of executive orders, and memorandums and proclamations on tariffs, which was a catalyst for the first Sell America episode since 1982.
And Buy, Sell America, I’m referring to material and simultaneous decline in U.S. equities: The dollar or Treasury bonds combined with U.S. equity underperformance versus the rest of the world, like Blanche Dubois and Streetcar Named Desire. The U.S. relies a lot on the kindness of strangers, and one of the first charts in this piece shows how much the U.S. is now reliant on foreign versus domestic net savings.
So U.S. is almost entirely reliant on this basis anyway on foreign net saving. So a, a Sell America episode is not a good one, but it looks like for my 62nd birthday, I think that’s right. Trump is going to set the China reciprocal tariff rate at 10% like the rest of the country, the rest of the other countries, in which case, we’ve updated our tariff rate on all U.S. imports.
And it now looks like if we assume this temporary negotiation holds, that we’re approaching an equilibrium state, which is a roughly 10% reciprocal tariff on a bunch of goods, other goods get exempted and then other goods get subject to Section 232 product-specific tariffs. The big picture is that you’re still looking at the largest tariff increase in, you know, 70 years or so.
But it’s a lot lower, roughly half of what it was, let’s say a month and a half ago. So now that that’s happening and we’re approaching maybe some kind of steady state that countries and companies can adapt to, let’s go back to some regularly scheduled programs, which is an update on AI, which was the primary driver of U.S. equity markets for all this before all this trade war stuff began.
And even during all of this trade stuff going on, U.S. companies spent more time on Q1 earnings calls talking about AI adoption than they did about tariffs, which is interesting. And another thing to keep in mind is that the market capitalization of companies that benefit directly or indirectly from AI is two-and-a-half times larger than the market cap of the company, U.S. companies, that would be the victims of tariffs.
So I think you could make the argument that AI is at least as important as tariffs to equity investors, if not more so. At the same time, the premium one would pay for AI plays relative to the stock market is, is back down to the level it last was in 2017. So it’s probably a good time to be taking a look at this.
So one of the things that happened during all the tariff stuff and the Sell America discussions was a lot of people were writing how U.S. equities are very, very expensive versus the rest of the world. And if you just look simply at PE multiples of the U.S. versus Europe or Japan, that’s what you would probably conclude. But equities can be cheap or expensive relative to each other for certain reasons.
And I always remind people, U.S. companies are a lot more profitable than their non-U.S. counterparts. And so if you look at that, we have a chart in here I’ve shown on this page for people watching, if you plot ROE against price-to-book, for example. In other words, fundamentals versus valuations, there’s a very linear relationship. The more that a sector, the higher the sector ROE, the higher the price-to-book ratio.
And so on that basis, the U.S. equities don’t look quite so mispriced relative to the rest of the world. Here, we’re comparing it to the developed world ex-U.S. We have a number of different ways of running the chart that will tell you the same message. Now the top dot on here is of course the tech sector, which has the highest price-to-book ratio, but also by far the highest projected return on equity, and as a sign of just how successful the tech and interactive media space has been, that sector now accounts for 35% of all the earnings in the market compared to just 19% a decade ago.
And over the last couple of years, the primary driver of, in the tech space, has been AI adoption. So that’s what we want to focus on. So here we refreshed, Stanford does this, and we refreshed their chart on how AI capabilities are advancing. They’ve been generating this chart for several years. And it, and it looks at how AI models do versus humans on a number of different things related to classifying images: visual reasoning, language understanding, math, science, things like that.
And, and AI capabilities have now more or less matched or exceeded humans. And at the same time, costs have gone down a lot. Increasingly small but very powerful models, inference costs for a system performing, let’s say, at the level of GPT three and a half, dropped by almost 300 times between November 22nd and October 2024. And also hardware costs are declining. Energy efficiency is improving.
Open-weight models are closing the gap with closed-weight models. So there’s a lot of things going on. I think it’s now that the tariff equilibrium appears to be set, I think it’s time for us to refocus on, on these other things. So what I want to do is quickly walk through, and this is what’s in the AI in the market this week, I want to walk through the things that are most visible to the things that are least visible.
What’s most visible is the increase in hyperscaler spending. But whether in dollar terms, share revenues, I think it was almost the way we track it. We look at capital spending plus R&D. And just for the for big hyperscalers, it was $450,000,000,000 in 2024. And expected to be substantially like 30% higher that in 2025, which is kind of amazing.
So that’s the most visible thing you can see. Then the next visible thing that you can see is improving capabilities of AI models on, versus tasks and exams and things like that, you know, on paper. Then the next visible thing to see, which is not as visible, is AI adoption by the corporate sector. And then the hardest thing to find is the, is to understand the true pace of AI-related revenue growth of the hyperscalers, which is really an important thing.
So right now some of these things are a lot more visible than others. The hyperscalers continue to live by this mantra: We have more to lose by underspending than overspending. Okay, great. But at least we can see some evidence of AI adoption and revenues associated with it. So when I first started writing about language models in February 2023, there were a lot of questions about hallucinations, and just how relevant my language model scores on multiple choice exams were when the models were trained on the answers to those exams.
So all you were really getting was a sense of are they good memories? And yes, they are good memories. But what progress has been made on, on a number of fronts? The models are now tested against way more advanced exams than simple multiple choice. And while you can’t eliminate the contamination issue entirely in most cases, a lot of these models are doing much better on graduate level science questions that require multi-step reasoning across physics, biology and chemistry.
And, and they’re doing better on math questions. That involves symbolic reasoning, in algebra and combinatorics and number theory, and not just pattern following and guessing the next word or guessing the next number. So a lot has been done since, since, let’s say over the last two, two and a half years. And here’s a chart, for example, on how the different language models are doing on this Google-proof human Q&A test.
In other words, things that you can’t find the answers on Google, or at least not very easily, and from, from mid-’23 to, let’s say, the fall of last year, the models were still languishing in the 30 to 50% range in terms of scores. With the advent of some of the reasoning models, those scores have gone up to, let’s say, 70 to 90% across most of the models that you look at.
Similarly, reasoning models have really helped how these models do on, on math. So the next chart is one on this U.S. math Olympiad selection exam. Again, the models were languishing with a, really crappy scores. I don’t know if that’s a client- or a compliance-approved word, but whatever. In 2023, and then late last year in the fall, again, around the same time that the models started doing better on the Google-proof exams, they started doing better on some of these math exams.
With the advent of the reasoning models, whether it’s Claude or Gemini, or, or O3 or things like that. Now those are just exams, right? And exams don’t have a ton of practical use in the real world.
Now, to be clear, a lot of these tests and exams and tasks are things that a lot of these models are scoring well on, but only after all of the AI model builders torture their models to do well on them. And so, we also have to take a look at the hallucination rates that some of the new reasoning models are experiencing in the wild when they're not working on certain preset exercises and they are very high.
You know, this hallucination right now is a big problem for reasoning models. One possible explanation is that some of them are recursively sampling based models that have single digit hallucination rates, but if you keep sampling it multiple times, you're going to end up with a very high hallucination rate. Some people think this is more of an engineering problem than a science problem.
There may be paths around it, but the bottom line is look at this table. The hallucination rates of OpenAI's suite of reasoning models is, roughly in the neighborhood of 50%. And sometimes it provides broken links. Sometimes it even describes steps that it did in interim computations that it didn't even do. I mean, it tells lies and falsehoods that, my four year old used to do, and that's how easily some of them are identified.
So I think it's it's important to understand that some of the improved proficiency that we're seeing on the prior charts and pages are things they've been trained to do better at. Whereas, used on a kind of random, broad basis, by the, by the rest of us, by the diaspora of users, we're going to be much more subject to hallucination.
Risks. This is something called good hearts law in spades. Good hearts law means once a benchmark becomes widely acceptable, it tends to lose its value because people game and manipulate the outcome. The best example is in colonial India. They had a problem with cobras, and so they paid people to bring in cobras so they could kill them.
So then people started to breed cobra so they could bring them in and get paid for their cobras, and they ended up with too many cobras. Anyway, it's, it's very important to understand that the hallucination risk for the reasoning models are still very high and is still a problem that has to be solved. It means also that in corporate applications, all of those kind of hallucination risks have to be bred out of whatever process that those reasoning models are used for.
They’re interesting things to look at. What’s more interesting to us is how do they do on coding? And these models are now being used to see how they can do in terms of writing and editing code, and here this test looks at the ability of these models to execute over 200 tasks in multiple coding languages.
Some of them are still only getting a little bit more than half of these exercises correct. But other ones like the ’04 Mini and Gemini two-and-a-half Pro are, which is Google’s product, is, is doing much better in the 70 to 80% range. And remember, there’s something important about models like this. If you’re dealing with a system like air traffic control, self-driving cars and interpreting people’s MRIs, mistakes are catastrophic.
And so a model that scores less than perfect is a problem. But for most tasks that a lot of these things are being used for and might be used for, there’s the ability to both apply human intervention and also other models to come in and clean up mistakes. So when the consequence of mistakes are not catastrophic, I think most model success scores less than 100% can be perfectly viable and still ,and can still add to productivity.
Here’s another score. Same thing. I model coding competitions within OpenAI’s own universe. When the reasoning models kicked in last year, the score started to go up substantially. So we get into all the details on this. That’s enough of that technical stuff. Actually one more because I think this is important too. How complex a task can these models try to tackle?
When we first started talking about these models a couple of years ago, we were using them as benchmarks against just looking for an answer to a question in Wikipedia, whereas now we’re asking them to write emails, create websites, analyze data sets. And so a recent paper in Nature magazine looked at how long these models can stay on track while working through some very complex, multi-step problems.
And these things have improved a lot. So if you want more information on that now, the models still struggle with certain real0-world issues. There are certain, I think, more valuable tests that look at things like can they fix bugs or add features in GitHub repositories? So far, only Anthropic’s Claude product has, has more than a 50% score on this.
And there’s other examples as well. There’s something called Humanity’s Last Exam, where most of the models still do quite poorly. And then I also asked GPT4 to draw a map of Europe. And it did a hysterically and hilariously bad job even after when I asked it to try and fix it. It referred to it labeled London as, the city as bland. Now, no argument there, but you know that’s a mistake. And then, and that’s in the appendix of the written piece, I think the most important thing is other than the coding exercises, none of these benchmarks really have any impact on a chief technology officer. Officer that’s thinking about enterprise adoption, of AI or, or things to drive their business impact through enterprise use cases.
So that’s where we’re going to look at next. Enough with all of these theoretical exams and things. What’s going on in the real world? Well, I’m not a huge fan of McKinsey surveys. I think there’s questions about rigor and thoroughness, and things like that. My favorite study about consultants came out a few years ago, and shows that when you hire consultants like McKinsey, the most likely outcome is that six or nine months later, you’re still hiring consultants like McKinsey anyway.
That said, they did a survey of about 1,500 companies, and asked them questions about how much over the next three years do you think you’re going to reduce employees? How much are you going to reduce overall costs? How much your revenue is going to go up? And to simplify this, this is this, survey, the, the good news is that around 50% of all respondents said that they expect employee cost reductions and revenue increases from adoption of generative AI.
That’s the good news. The bad news is the most frequent answer in almost each case of the people who said it was going to help, was the smallest amount of help. In other words, the most frequent three-year reduction expected was 3 to 10% of employees, rather than 11 to 20 or more than 20, etc. Same story with revenue. The most frequent response was it’ll help us by less than 5%. But that said, it does show that AI adoption is increasing in real-world business cases. And we’re getting the same story from a Bain survey that was just completed, that looked at the change over the last year, the AI adoption cases in, you know, have gone up by 50 to 60%.
And the census also does an interesting survey where they look at adoption rates by sector over time. And you’re starting to see adoption rates of 20 to 30% in some of the sectors where you’d expect to see them. I thought an interesting anecdote was, and this is from one of our AI researchers, there’s a Mexican used car platform that told our researchers they replaced their entire outbound sales team with an AI voice model powered by an, you know, a generative AI platform. And most of the time, customers can’t even tell they’re speaking to a voice bot. And the voice bot does a better job converting customers than the human baseline they’re comparing it to.
So another couple of things, there’s been a sharp increase in FDA-approved medical devices that rely on AI, and I want to finish up with something about the FDA at the end. But there’s been an increase in the use of medical devices.
And then there was an article in The Atlantic that I thought was interesting that, that proposes the idea that AI is starting to impact the job market. And they show this chart as an example of that, which is for the better part of the last 30 years, the overall unemployment rate was higher than the unemployment rate for recent college graduates, whereas since 2020, which is, you know, before some of this AI stuff really got kicked in. But since 2020, that number has been falling. So recent graduates have higher unemployment rates in the overall market and what recent college graduates do. They summarize information, they aggregate data and they create charts and tables and graphs. And if that’s what AI is getting better at, and if that’s what I was being used for, that would help explain this.
So this is indirect evidence at best. Now, last September when I last did an AI update, I expressed a lot of concern that the hyperscalers were spending a ton of money, and that we would pretty soon need to see hard evidence of them starting to get a return on all that investment. And at the time, I cited an analysis by a guy named David Kohn at Sequoia, where he backed into using his own assumptions how much the industry would need to earn every year, assuming certain capital spending and gross margin of the hyperscalers. And he got a number like $500 billion a year in annual incremental AI revenue. Now, he assumed the requirement for a very rapid payback period. But even if you relax that constraint, you need some very big-figure AI revenues for these companies, given that they’re spending hundreds of billions of dollars on CapEx, on R&D every year. So what are we seeing?
Well, first of all, the hyperscale or capital spending in R&D as a share revenue figures are starting to creep up again. So in 2022 to 2024 for Microsoft, Google, Amazon, those numbers were, were kind of plateauing at around 25%. So in other words, they were spending more, but they’re also earning more. Now those numbers are starting to go up. And then we’re at 30 to 35%. So we have to start to watch this because it seems like capital spending growth is overtaking the overall revenues of these businesses. Now the good news is Microsoft gave us some clues. As far as we can tell, they’re the only one of the hyperscalers that’s giving you hard data on how much they’re earning from AI. On a trailing one-year basis, it looks like 3 to $3.5 billion. Obviously growing substantially, and in terms of year-on-year rates.
And then another interesting observation: They said they processed 100 trillion tokens in Q1 2025; 50 trillion of them were just in March. Obviously that’s a super jargony thing for them to say. But when you translate it into English, but based on what we know about the way these models work, it means that there’s a lot of inference activity going on, and not just model training.
So there’s an inference. Models are typically a sign that corporations are adopting AI models, using them in actual workflows. So that was a good sign too, and for the other hyperscalers, their AI revenues are buried inside the cloud. And we’re starting to see quarterly revenues, particularly for Amazon, Microsoft and Google, start to pick up, although they’re kind of flat for Oracle and IBM. So we’re seeing some evidence that AI revenues are picking up when we look at the cloud.
But at the end of the day, the most important chart is this highly cyclical one. But it’s, it’s, it’s what’s going to go on with how long can these hyperscalers keep this going? And that’s going to be a function of their overall free cash flow margins. Amazon’s is still negative. But for Meta and for Microsoft and for Google, they’re still hanging in with 20 to 30% free cash flow margins. As long as that’s the case, I think they’ll be able to keep these capital spending wars going. But that’s the thing we have to watch the most. If you start to see a sustained dip for Meta, Microsoft and, and Google, below the 20% level in terms of free cash flow margins, I think the markets would be very concerned that the capital spending is getting way ahead of the AI revenue generation.
So what happens next? Our people—you know, we have a lot of AI going on inside the company—our people tell me there’s too much focus on software developers. They only spend around 30% of their time actually coding. So even if you approve their productivity in coding by 50%, you’re still only talking about a 15% productivity improvement.
And they think the larger gains, instead of coding per se from generative AI, are things like software maintenance, unit testing, integration testing, performance monitoring. These are harder things to measure, but they think the savings potential is much greater behind the scenes. I think Microsoft and Amazon are likely to accelerate efforts to build their own foundational models. There’s a lot of stuff going on that you can read about each week in the press on the ongoing divorce between OpenAI and Microsoft.
Amazon, Google and, and Microsoft are trying to manufacture their own GPU-like chips to break Nvidia’s stranglehold on the market. And there’s a bunch of AI adoption milestones to watch for over the next couple of years in terms of self-driving cars and drones, and multimodal, AI used in entertainment, personalized AI system and things like that.
But at the end of the day, looking back to the 1990s, this is the biggest capital spending experiment on record by the tech sector. We’re now setting new, consistent highs in terms of capital spending in R&D as a share of revenues. And so the bottom line is this thing better work. And, and the adoption rates are going to have to continue to start going up pretty soon.
I did mention AI approved, the AI and machine learning approved, involved medical devices that have been approved by the FDA.
Speaking of the FDA, I don’t know if I should be drinking so many Frescas. I have no idea if that’s healthy or not. So I did want to mention one thing about the FDA: The drug approval rates have fallen in half, at least in Q1 of this year. And I was thinking that, about that recently when I found out people immunized between 1963 and 1967 for measles, mumps, rubella received an inactivated version of the vaccine. And these, and it was just that brief period because the live vaccine wasn’t approved, but pardon me until 1967. So the problem is the, in the inactivated version of the vaccine gives you much lower immunity than the live one, and some of the studies show that after getting the vaccine, only a quarter of the people still had detectable antibodies at some point later.
So the CDC is recommending that individuals vaccinated during that period get a new live vaccine vaccine. But for certain medical conditions, like one I happen to have, you can’t get live attenuated live vaccines because they’ve got live attenuated viruses, and they’re not good for people that have certain kinds of immune deficiencies. And so, you know, that’s that. So now people like me and others are being negatively impacted by all the people that are deciding that they don’t want to get vaccinated anymore for MMR even though it’s 97% effective against the spread of measles. And just to give you a sense, the, the infectiousness, the infectiousness measure of COVID snf the flu is like one to two for polio and smallpox. That’s five or seven. And for measles, it’s 12 to 18. And but what’s going on in the country is over the last 10 years, the vaccination rates have fallen from, nationwide, 95 to 92%.
But there’s a whole bunch of states below 90%. Georgia, Colorado, Wisconsin, Alaska and Idaho has fallen to 80%. And in Gaines County, Texas, where a lot of the measles cases have occurred, vaccination rates are 80%, and one school district is below 50%. And, and there was a study recently from Stanford that estimated that measles could become endemic again within two decades.
Given these declines in vaccination at the same time, instead of consistently messaging the importance of the vaccine, RFK, Jr. has directed health agencies to explore potential new treatments for people to get vaccines. They get measles, including vitamins and cod liver oil. I think it’s a good time for me to stop this podcast right there. Thank you for listening.
And we’ll see you again next time. Bye.
Read or listen to Back to our Regularly Scheduled Programming
About Eye on the Market
Since 2005, Michael has been the author of Eye on the Market, covering a wide range of topics across the markets, investments, economics, politics, energy, municipal finance and more.