Back to our Regularly Scheduled Programming
Back to our Regularly Scheduled Programming: an update on AI capabilities, corporate AI adoption and hyperscaler AI revenues vs spending
With some kind of tariff equilibrium possibly within reach, we return to some regularly scheduled programming: artificial intelligence and language models which were the primary drivers of equity markets before the trade wars began. To be clear, our estimate of the US bilateral tariff rate on China would still be ~40% after incorporating all the individual pieces announced so far. But with some tariff clarity, markets may be able to refocus on other things. That’s what companies are doing: they focused a lot more on AI than tariffs in Q1 earnings calls. This note covers massive hyperscaler AI spending, the improving capabilities of AI reasoning models, increased signs of corporate AI adoption and the scavenger hunt for growth in hyperscaler AI related revenues.
Morning. Welcome back to the Eye on the Market podcast. This one is mid-May, and it’s called back to our regularly scheduled programing and update on—I, I’ve spent a lot of time this year on the intersection between politics and economics and markets, for good reason. There was a flurry of executive orders, and memorandums and proclamations on tariffs, which was a catalyst for the first Sell America episode since 1982.
And Buy, Sell America, I’m referring to material and simultaneous decline in U.S. equities: The dollar or Treasury bonds combined with U.S. equity underperformance versus the rest of the world, like Blanche Dubois and Streetcar Named Desire. The U.S. relies a lot on the kindness of strangers, and one of the first charts in this piece shows how much the U.S. is now reliant on foreign versus domestic net savings.
So U.S. is almost entirely reliant on this basis anyway on foreign net saving. So a, a Sell America episode is not a good one, but it looks like for my 62nd birthday, I think that’s right. Trump is going to set the China reciprocal tariff rate at 10% like the rest of the country, the rest of the other countries, in which case, we’ve updated our tariff rate on all U.S. imports.
And it now looks like if we assume this temporary negotiation holds, that we’re approaching an equilibrium state, which is a roughly 10% reciprocal tariff on a bunch of goods, other goods get exempted and then other goods get subject to Section 232 product-specific tariffs. The big picture is that you’re still looking at the largest tariff increase in, you know, 70 years or so.
But it’s a lot lower, roughly half of what it was, let’s say a month and a half ago. So now that that’s happening and we’re approaching maybe some kind of steady state that countries and companies can adapt to, let’s go back to some regularly scheduled programs, which is an update on AI, which was the primary driver of U.S. equity markets for all this before all this trade war stuff began.
And even during all of this trade stuff going on, U.S. companies spent more time on Q1 earnings calls talking about AI adoption than they did about tariffs, which is interesting. And another thing to keep in mind is that the market capitalization of companies that benefit directly or indirectly from AI is two-and-a-half times larger than the market cap of the company, U.S. companies, that would be the victims of tariffs.
So I think you could make the argument that AI is at least as important as tariffs to equity investors, if not more so. At the same time, the premium one would pay for AI plays relative to the stock market is, is back down to the level it last was in 2017. So it’s probably a good time to be taking a look at this.
So one of the things that happened during all the tariff stuff and the Sell America discussions was a lot of people were writing how U.S. equities are very, very expensive versus the rest of the world. And if you just look simply at PE multiples of the U.S. versus Europe or Japan, that’s what you would probably conclude. But equities can be cheap or expensive relative to each other for certain reasons.
And I always remind people, U.S. companies are a lot more profitable than their non-U.S. counterparts. And so if you look at that, we have a chart in here I’ve shown on this page for people watching, if you plot ROE against price-to-book, for example. In other words, fundamentals versus valuations, there’s a very linear relationship. The more that a sector, the higher the sector ROE, the higher the price-to-book ratio.
And so on that basis, the U.S. equities don’t look quite so mispriced relative to the rest of the world. Here, we’re comparing it to the developed world ex-U.S. We have a number of different ways of running the chart that will tell you the same message. Now the top dot on here is of course the tech sector, which has the highest price-to-book ratio, but also by far the highest projected return on equity, and as a sign of just how successful the tech and interactive media space has been, that sector now accounts for 35% of all the earnings in the market compared to just 19% a decade ago.
And over the last couple of years, the primary driver of, in the tech space, has been AI adoption. So that’s what we want to focus on. So here we refreshed, Stanford does this, and we refreshed their chart on how AI capabilities are advancing. They’ve been generating this chart for several years. And it, and it looks at how AI models do versus humans on a number of different things related to classifying images: visual reasoning, language understanding, math, science, things like that.
And, and AI capabilities have now more or less matched or exceeded humans. And at the same time, costs have gone down a lot. Increasingly small but very powerful models, inference costs for a system performing, let’s say, at the level of GPT three and a half, dropped by almost 300 times between November 22nd and October 2024. And also hardware costs are declining. Energy efficiency is improving.
Open-weight models are closing the gap with closed-weight models. So there’s a lot of things going on. I think it’s now that the tariff equilibrium appears to be set, I think it’s time for us to refocus on, on these other things. So what I want to do is quickly walk through, and this is what’s in the AI in the market this week, I want to walk through the things that are most visible to the things that are least visible.
What’s most visible is the increase in hyperscaler spending. But whether in dollar terms, share revenues, I think it was almost the way we track it. We look at capital spending plus R&D. And just for the for big hyperscalers, it was $450,000,000,000 in 2024. And expected to be substantially like 30% higher that in 2025, which is kind of amazing.
So that’s the most visible thing you can see. Then the next visible thing that you can see is improving capabilities of AI models on, versus tasks and exams and things like that, you know, on paper. Then the next visible thing to see, which is not as visible, is AI adoption by the corporate sector. And then the hardest thing to find is the, is to understand the true pace of AI-related revenue growth of the hyperscalers, which is really an important thing.
So right now some of these things are a lot more visible than others. The hyperscalers continue to live by this mantra: We have more to lose by underspending than overspending. Okay, great. But at least we can see some evidence of AI adoption and revenues associated with it. So when I first started writing about language models in February 2023, there were a lot of questions about hallucinations, and just how relevant my language model scores on multiple choice exams were when the models were trained on the answers to those exams.
So all you were really getting was a sense of are they good memories? And yes, they are good memories. But what progress has been made on, on a number of fronts? The models are now tested against way more advanced exams than simple multiple choice. And while you can’t eliminate the contamination issue entirely in most cases, a lot of these models are doing much better on graduate level science questions that require multi-step reasoning across physics, biology and chemistry.
And, and they’re doing better on math questions. That involves symbolic reasoning, in algebra and combinatorics and number theory, and not just pattern following and guessing the next word or guessing the next number. So a lot has been done since, since, let’s say over the last two, two and a half years. And here’s a chart, for example, on how the different language models are doing on this Google-proof human Q&A test.
In other words, things that you can’t find the answers on Google, or at least not very easily, and from, from mid-’23 to, let’s say, the fall of last year, the models were still languishing in the 30 to 50% range in terms of scores. With the advent of some of the reasoning models, those scores have gone up to, let’s say, 70 to 90% across most of the models that you look at.
Similarly, reasoning models have really helped how these models do on, on math. So the next chart is one on this U.S. math Olympiad selection exam. Again, the models were languishing with a, really crappy scores. I don’t know if that’s a client- or a compliance-approved word, but whatever. In 2023, and then late last year in the fall, again, around the same time that the models started doing better on the Google-proof exams, they started doing better on some of these math exams.
With the advent of the reasoning models, whether it’s Claude or Gemini, or, or O3 or things like that. Now those are just exams, right? And exams don’t have a ton of practical use in the real world.
Now, to be clear, a lot of these tests and exams and tasks are things that a lot of these models are scoring well on, but only after all of the AI model builders torture their models to do well on them. And so, we also have to take a look at the hallucination rates that some of the new reasoning models are experiencing in the wild when they're not working on certain preset exercises and they are very high.
You know, this hallucination right now is a big problem for reasoning models. One possible explanation is that some of them are recursively sampling based models that have single digit hallucination rates, but if you keep sampling it multiple times, you're going to end up with a very high hallucination rate. Some people think this is more of an engineering problem than a science problem.
There may be paths around it, but the bottom line is look at this table. The hallucination rates of OpenAI's suite of reasoning models is, roughly in the neighborhood of 50%. And sometimes it provides broken links. Sometimes it even describes steps that it did in interim computations that it didn't even do. I mean, it tells lies and falsehoods that, my four year old used to do, and that's how easily some of them are identified.
So I think it's it's important to understand that some of the improved proficiency that we're seeing on the prior charts and pages are things they've been trained to do better at. Whereas, used on a kind of random, broad basis, by the, by the rest of us, by the diaspora of users, we're going to be much more subject to hallucination.
Risks. This is something called good hearts law in spades. Good hearts law means once a benchmark becomes widely acceptable, it tends to lose its value because people game and manipulate the outcome. The best example is in colonial India. They had a problem with cobras, and so they paid people to bring in cobras so they could kill them.
So then people started to breed cobra so they could bring them in and get paid for their cobras, and they ended up with too many cobras. Anyway, it's, it's very important to understand that the hallucination risk for the reasoning models are still very high and is still a problem that has to be solved. It means also that in corporate applications, all of those kind of hallucination risks have to be bred out of whatever process that those reasoning models are used for.
They’re interesting things to look at. What’s more interesting to us is how do they do on coding? And these models are now being used to see how they can do in terms of writing and editing code, and here this test looks at the ability of these models to execute over 200 tasks in multiple coding languages.
Some of them are still only getting a little bit more than half of these exercises correct. But other ones like the ’04 Mini and Gemini two-and-a-half Pro are, which is Google’s product, is, is doing much better in the 70 to 80% range. And remember, there’s something important about models like this. If you’re dealing with a system like air traffic control, self-driving cars and interpreting people’s MRIs, mistakes are catastrophic.
And so a model that scores less than perfect is a problem. But for most tasks that a lot of these things are being used for and might be used for, there’s the ability to both apply human intervention and also other models to come in and clean up mistakes. So when the consequence of mistakes are not catastrophic, I think most model success scores less than 100% can be perfectly viable and still ,and can still add to productivity.
Here’s another score. Same thing. I model coding competitions within OpenAI’s own universe. When the reasoning models kicked in last year, the score started to go up substantially. So we get into all the details on this. That’s enough of that technical stuff. Actually one more because I think this is important too. How complex a task can these models try to tackle?
When we first started talking about these models a couple of years ago, we were using them as benchmarks against just looking for an answer to a question in Wikipedia, whereas now we’re asking them to write emails, create websites, analyze data sets. And so a recent paper in Nature magazine looked at how long these models can stay on track while working through some very complex, multi-step problems.
And these things have improved a lot. So if you want more information on that now, the models still struggle with certain real0-world issues. There are certain, I think, more valuable tests that look at things like can they fix bugs or add features in GitHub repositories? So far, only Anthropic’s Claude product has, has more than a 50% score on this.
And there’s other examples as well. There’s something called Humanity’s Last Exam, where most of the models still do quite poorly. And then I also asked GPT4 to draw a map of Europe. And it did a hysterically and hilariously bad job even after when I asked it to try and fix it. It referred to it labeled London as, the city as bland. Now, no argument there, but you know that’s a mistake. And then, and that’s in the appendix of the written piece, I think the most important thing is other than the coding exercises, none of these benchmarks really have any impact on a chief technology officer. Officer that’s thinking about enterprise adoption, of AI or, or things to drive their business impact through enterprise use cases.
So that’s where we’re going to look at next. Enough with all of these theoretical exams and things. What’s going on in the real world? Well, I’m not a huge fan of McKinsey surveys. I think there’s questions about rigor and thoroughness, and things like that. My favorite study about consultants came out a few years ago, and shows that when you hire consultants like McKinsey, the most likely outcome is that six or nine months later, you’re still hiring consultants like McKinsey anyway.
That said, they did a survey of about 1,500 companies, and asked them questions about how much over the next three years do you think you’re going to reduce employees? How much are you going to reduce overall costs? How much your revenue is going to go up? And to simplify this, this is this, survey, the, the good news is that around 50% of all respondents said that they expect employee cost reductions and revenue increases from adoption of generative AI.
That’s the good news. The bad news is the most frequent answer in almost each case of the people who said it was going to help, was the smallest amount of help. In other words, the most frequent three-year reduction expected was 3 to 10% of employees, rather than 11 to 20 or more than 20, etc. Same story with revenue. The most frequent response was it’ll help us by less than 5%. But that said, it does show that AI adoption is increasing in real-world business cases. And we’re getting the same story from a Bain survey that was just completed, that looked at the change over the last year, the AI adoption cases in, you know, have gone up by 50 to 60%.
And the census also does an interesting survey where they look at adoption rates by sector over time. And you’re starting to see adoption rates of 20 to 30% in some of the sectors where you’d expect to see them. I thought an interesting anecdote was, and this is from one of our AI researchers, there’s a Mexican used car platform that told our researchers they replaced their entire outbound sales team with an AI voice model powered by an, you know, a generative AI platform. And most of the time, customers can’t even tell they’re speaking to a voice bot. And the voice bot does a better job converting customers than the human baseline they’re comparing it to.
So another couple of things, there’s been a sharp increase in FDA-approved medical devices that rely on AI, and I want to finish up with something about the FDA at the end. But there’s been an increase in the use of medical devices.
And then there was an article in The Atlantic that I thought was interesting that, that proposes the idea that AI is starting to impact the job market. And they show this chart as an example of that, which is for the better part of the last 30 years, the overall unemployment rate was higher than the unemployment rate for recent college graduates, whereas since 2020, which is, you know, before some of this AI stuff really got kicked in. But since 2020, that number has been falling. So recent graduates have higher unemployment rates in the overall market and what recent college graduates do. They summarize information, they aggregate data and they create charts and tables and graphs. And if that’s what AI is getting better at, and if that’s what I was being used for, that would help explain this.
So this is indirect evidence at best. Now, last September when I last did an AI update, I expressed a lot of concern that the hyperscalers were spending a ton of money, and that we would pretty soon need to see hard evidence of them starting to get a return on all that investment. And at the time, I cited an analysis by a guy named David Kohn at Sequoia, where he backed into using his own assumptions how much the industry would need to earn every year, assuming certain capital spending and gross margin of the hyperscalers. And he got a number like $500 billion a year in annual incremental AI revenue. Now, he assumed the requirement for a very rapid payback period. But even if you relax that constraint, you need some very big-figure AI revenues for these companies, given that they’re spending hundreds of billions of dollars on CapEx, on R&D every year. So what are we seeing?
Well, first of all, the hyperscale or capital spending in R&D as a share revenue figures are starting to creep up again. So in 2022 to 2024 for Microsoft, Google, Amazon, those numbers were, were kind of plateauing at around 25%. So in other words, they were spending more, but they’re also earning more. Now those numbers are starting to go up. And then we’re at 30 to 35%. So we have to start to watch this because it seems like capital spending growth is overtaking the overall revenues of these businesses. Now the good news is Microsoft gave us some clues. As far as we can tell, they’re the only one of the hyperscalers that’s giving you hard data on how much they’re earning from AI. On a trailing one-year basis, it looks like 3 to $3.5 billion. Obviously growing substantially, and in terms of year-on-year rates.
And then another interesting observation: They said they processed 100 trillion tokens in Q1 2025; 50 trillion of them were just in March. Obviously that’s a super jargony thing for them to say. But when you translate it into English, but based on what we know about the way these models work, it means that there’s a lot of inference activity going on, and not just model training.
So there’s an inference. Models are typically a sign that corporations are adopting AI models, using them in actual workflows. So that was a good sign too, and for the other hyperscalers, their AI revenues are buried inside the cloud. And we’re starting to see quarterly revenues, particularly for Amazon, Microsoft and Google, start to pick up, although they’re kind of flat for Oracle and IBM. So we’re seeing some evidence that AI revenues are picking up when we look at the cloud.
But at the end of the day, the most important chart is this highly cyclical one. But it’s, it’s, it’s what’s going to go on with how long can these hyperscalers keep this going? And that’s going to be a function of their overall free cash flow margins. Amazon’s is still negative. But for Meta and for Microsoft and for Google, they’re still hanging in with 20 to 30% free cash flow margins. As long as that’s the case, I think they’ll be able to keep these capital spending wars going. But that’s the thing we have to watch the most. If you start to see a sustained dip for Meta, Microsoft and, and Google, below the 20% level in terms of free cash flow margins, I think the markets would be very concerned that the capital spending is getting way ahead of the AI revenue generation.
So what happens next? Our people—you know, we have a lot of AI going on inside the company—our people tell me there’s too much focus on software developers. They only spend around 30% of their time actually coding. So even if you approve their productivity in coding by 50%, you’re still only talking about a 15% productivity improvement.
And they think the larger gains, instead of coding per se from generative AI, are things like software maintenance, unit testing, integration testing, performance monitoring. These are harder things to measure, but they think the savings potential is much greater behind the scenes. I think Microsoft and Amazon are likely to accelerate efforts to build their own foundational models. There’s a lot of stuff going on that you can read about each week in the press on the ongoing divorce between OpenAI and Microsoft.
Amazon, Google and, and Microsoft are trying to manufacture their own GPU-like chips to break Nvidia’s stranglehold on the market. And there’s a bunch of AI adoption milestones to watch for over the next couple of years in terms of self-driving cars and drones, and multimodal, AI used in entertainment, personalized AI system and things like that.
But at the end of the day, looking back to the 1990s, this is the biggest capital spending experiment on record by the tech sector. We’re now setting new, consistent highs in terms of capital spending in R&D as a share of revenues. And so the bottom line is this thing better work. And, and the adoption rates are going to have to continue to start going up pretty soon.
I did mention AI approved, the AI and machine learning approved, involved medical devices that have been approved by the FDA.
Speaking of the FDA, I don’t know if I should be drinking so many Frescas. I have no idea if that’s healthy or not. So I did want to mention one thing about the FDA: The drug approval rates have fallen in half, at least in Q1 of this year. And I was thinking that, about that recently when I found out people immunized between 1963 and 1967 for measles, mumps, rubella received an inactivated version of the vaccine. And these, and it was just that brief period because the live vaccine wasn’t approved, but pardon me until 1967. So the problem is the, in the inactivated version of the vaccine gives you much lower immunity than the live one, and some of the studies show that after getting the vaccine, only a quarter of the people still had detectable antibodies at some point later.
So the CDC is recommending that individuals vaccinated during that period get a new live vaccine vaccine. But for certain medical conditions, like one I happen to have, you can’t get live attenuated live vaccines because they’ve got live attenuated viruses, and they’re not good for people that have certain kinds of immune deficiencies. And so, you know, that’s that. So now people like me and others are being negatively impacted by all the people that are deciding that they don’t want to get vaccinated anymore for MMR even though it’s 97% effective against the spread of measles. And just to give you a sense, the, the infectiousness, the infectiousness measure of COVID snf the flu is like one to two for polio and smallpox. That’s five or seven. And for measles, it’s 12 to 18. And but what’s going on in the country is over the last 10 years, the vaccination rates have fallen from, nationwide, 95 to 92%.
But there’s a whole bunch of states below 90%. Georgia, Colorado, Wisconsin, Alaska and Idaho has fallen to 80%. And in Gaines County, Texas, where a lot of the measles cases have occurred, vaccination rates are 80%, and one school district is below 50%. And, and there was a study recently from Stanford that estimated that measles could become endemic again within two decades.
Given these declines in vaccination at the same time, instead of consistently messaging the importance of the vaccine, RFK, Jr. has directed health agencies to explore potential new treatments for people to get vaccines. They get measles, including vitamins and cod liver oil. I think it’s a good time for me to stop this podcast right there. Thank you for listening.
And we’ll see you again next time. Bye.
(DESCRIPTION)
Logo: J.P. Morgan. Eye on the Market.
Slide: May 2025. Title: Back to our Regularly Scheduled Programming. An update on Al capabilities, corporate Al adoption, and hyperscaler Al revenues versus spending.
(SPEECH)
Morning. Welcome back to the Eye on the Market podcast. This one is mid-May and it's called back to our regularly scheduled programming and update on AI.
(DESCRIPTION)
Slide: Blanche Debois. Line graph titled US net savings. Y-axis: Percent of GDP: negative 4% to positive 14% in increments of 2%. X-axis: 1955 to 2025 in increments of 10 years. Three color-coded lines: Foreigner's net savings in the US; US national net savings; and US net savings (national + foreign). Source: FRB, BEA, JPMAM. September 2024.
(SPEECH)
I've spent a lot of time this year on the intersection between politics and economics and markets. For good reason, there was a flurry of executive orders and memorandums and proclamations on tariffs, which was a catalyst for the first sell America episode since 1982.
And by sell America, referring to material and simultaneous decline in US equities, the dollar or treasury bonds combined with US equity underperformance versus the rest of the world. Like Blanche DuBois in A Streetcar Named Desire, the US relies a lot on the kindness of strangers, and one of the first charts in this piece shows how much the US is now reliant on foreign versus domestic net savings.
So US is almost entirely reliant on this basis anyway on foreign net savings. So a sell America episode is not a good one. But it looks like for my 62 birthday, I think that's right. Trump is going to set the China reciprocal tariff rate at 10% like the rest of the other countries, in which case we've updated our tariff rate on all US imports.
(DESCRIPTION)
Slide: Tariff update, assuming 10% reciprocal rate on everyone. Line graph titled Average tariff rate on all US imports. Y-axis: 0% to 30% in increments of 5%. X-axis: 1900 to 2025 in increments of 25 years. One blue line. Two red notes:. +25% expected section-specific tariffs on semiconductors, pharma, copper, lumber. +10% reciprocal tariff. Two blue notes: +25% on global autos. +20% on China, 25% on Mexico and Canada non-USMCA, 25% steel & aluminum. Source: Tax Foundation, JPM Global Economics, GS Global Investment Research, JPMAM. May 12, 2025.
(SPEECH)
And it now looks like if we assume this temporary negotiation holds that we're approaching an equilibrium state, which is a roughly 10% reciprocal tariff on a bunch of goods, other goods get exempted and then other goods get subject to Section 232, product specific tariffs. The big picture is that you're still looking at the largest tariff increase in 70 years or so, but it's a lot lower, roughly half of what it was, let's say, a month and a half ago.
So now that that's happening and we're approaching maybe some kind of steady state that countries and companies can adapt to, let's go back to some regularly scheduled programming, which is an update on AI, which was the primary driver of US equity markets before all this trade war stuff began. And even during all of this trade stuff going on, US companies spent more time on Q1 earnings calls talking about AI adoption than they did about tariffs, which is interesting.
And another thing to keep in mind is that the market capitalization of companies that benefit directly or indirectly from AI is 2 and 1/2 times larger than the market cap of the company, US companies that would be the victims of tariffs. So I think you could make the argument that AI is at least as important as tariffs to equity investors, if not more so. At the same time, the premium one would pay for AI plays relative to the stock market is back down to the level of last was in 2017.
So it's probably a good time to be taking a look at this.
(DESCRIPTION)
Slide: US versus developed world profitability. Dot plot titled S&P 500 versus developed world ex US. Price to book ratio. Y-axis: 0 to 12 in increments of 2. X-axis: 0% to 35% in increments of 5%. Comparison of S&P 500 and MSCI World ex US in 10 categories: consumer discretionary; consumer staples; energy; financials; health care; industrials; information technology; materials; communication services; and utilities. Source: Bloomberg, Empirical Research, JP MAM. May 12, 2025.
(SPEECH)
So one of the things that happened during all the tariff stuff and the Sell America discussions was a lot of people were writing how US equities are very, very expensive versus the rest of the world. And if you just look simply at PE multiples of the US versus Europe or Japan, that's what you would probably conclude. But equities can be cheap or expensive relative to each other for certain reasons. And I always remind people, US companies are a lot more profitable than their non US counterparts. And so if you look at we have a chart in here. I have a chart on this page for people watching.
If you plot ROE against price to book, for example. In other words fundamentals versus valuations. There's a very linear relationship. The higher the sector ROE, the higher the price to book ratio. And so on that basis, the US equities don't look quite so mispriced relative to the rest of the world. Here we're comparing it to the developed world ex US. We have a number of different ways of running the chart that will all tell you the same message.
Now, the top dot on here is, of course, the tech sector which has the highest price to book ratio, but also by far the highest projected return on equity. And as a sign of just how successful the tech and interactive media space has been, that sector now accounts for 35% of all the earnings on the market compared to just 19% a decade ago. And over the last couple of years, the primary driver in the tech space has been AI adoption. So that's what we want to focus on.
(DESCRIPTION)
Slide: AI capabilities. Line graph titled AI versus human performance. Performance relative to human baseline, percent. Y-axis: 0% to 120% in increments of 20%. X-axis: 2012 to 2024 in increments of two years. Color-coded comparisons of eight learning categories measured against human performance: Image classification. Medium-level reading comprehension. Visual reasoning. English language understanding. Multitask language understanding. Competition-level mathematics. PhD-level science questions. Multimodal understanding and reasoning. Source: Stanford Human-Centered AI, JP MAM. April 2025.
(SPEECH)
So here if we refreshed, Stanford does this, and we will refreshed their chart on how AI capabilities are advancing. They've been generating this chart for several years. And it looks at how AI models do versus humans on a number of different things related to classifying images, visual reasoning, language understanding, math, science, things like that.
And AI capabilities have now more or less matched or exceeded humans. And at the same time, costs have gone down a lot. Increasingly small, but very powerful models. Inference costs for a system performing, let's say, at the level of GPT-3 and 1/2, dropped by almost 300 times between November 22 and October of 2024.
And also, hardware costs are declining, energy efficiency is improving, open weight models are closing the gap with closed weight models. So there's a lot of things going on. I think it's now that the tariff equilibrium appears to be set. I think it's time for us to refocus on these other things.
(DESCRIPTION)
Slide: AI update.
(SPEECH)
So what I want to do is quickly walk through. And this is what's in the Eye on the Market this week.
I want to walk through the things that are most visible to the things that are least visible. What's most visible is the increase in hyperscaler spending. But whether in dollar terms, share revenues, I think it was almost the way we track it. We look at capital spending plus R&D. And just for the four big hyperscalers, it was 450 billion in 2024 and expected to be substantially like 30% higher than that in 2025, which is amazing.
So that's the most visible thing you can see. Then the next visible thing that you can see is improving capabilities of AI models on versus tasks and exams and things like that on paper. Then the next visible thing to see, which is not as visible as AI adoption by the corporate sector. And then the hardest thing to find is to understand the true pace of AI related revenue growth of the hyperscalers, which is really an important thing.
So right now, some of these things are a lot more visible than others. The hyperscalers continue to live by this mantra. We have more to lose by underspending than overspending. OK, great. But at least we can see some evidence of AI adoption and revenues associated with it. So when I first started writing about language models in February 2023, there were a lot of questions about hallucinations and just how relevant my language model scores on multiple choice exams were when the models were trained on the answers to those exams.
So all you were really getting was a sense of, are they good memorizers? And yes, they are good memorizers. But progress has been made on a number of fronts. The models are now tested against way more advanced exams than simple multiple choice. And while you can't eliminate the contamination issue entirely in most cases, a lot of these models are doing much better on graduate level science questions that require multi-step reasoning across physics, biology, and chemistry. And they're doing better on math questions that involve symbolic reasoning, in algebra and combinatorics and number theory, and not just pattern following and guessing the next word or guessing the next number.
So a lot has been done since, let's say, over the last two, 2 and 1/2 years.
(DESCRIPTION)
Slide: Empirical signs of AI progress. Dot plot titled PhD-level science questions. Google Proof Q&A Diamond accuracy, percent. Y-axis: 20% to 90% in increments of 10%. X-axis: Model release date. June 2023 to March 2025 in increments of three months. Color-coded measurements of six platforms against expert human and random guessing: Google, Anthropic, DeepSeek, OpenAI, x AI, and Meta AI. Source: Epoch AI, JP MAM. May 5, 2025.
(SPEECH)
And here's a chart, for example, on how the different language models are doing on this Google-Proof human Q&A test. In other words, things that you can't find the answers on Google, or at least not very easily, and from mid 23 to let's say, the fall of last year, the models were still languishing in the 30% to 50% range in terms of scores. With the advent of some of the reasoning models those scores have gone up to let's say, 70% to 90% across most of the models that you look at.
(DESCRIPTION)
Slide: Empirical signs of AI progress. US Math Olympiad Selection Exam. Mock American Invitational Mathematics Examination Score. Color-coded dot plot using the same six platforms. Y-axis: 0% to 100% in increments of 10%. X-axis: Model release date. June 2023 to March 2025 in increments of three months. Color-coded measurements of five platforms: OpenAI, Google, Anthropic, DeepSeek, and Meta AI. Source: AIA Labs. April 2025
(SPEECH)
Similarly, reasoning models have really helped how these models do on math. So the next chart is one on this US Math Olympiad selection exam. Again, the models were languishing with really crappy scores, I don't know if that's a compliance approved word, but whatever, in 2023 and then late last year in the fall, again, around the same time that the models started doing better on the Google-Proof exams, they started doing better on some of these math exams. With the advent of the reasoning models, whether it's Claude or Gemini or O3 or things like that.
(DESCRIPTION)
Slide: AI model coding ability. Dot plot titled Code writing & editing. Aider polyglot benchmark, percent of exercises correct. Y-axis: 50% to 85% in increments of 5%. X-axis: Cost to run benchmark, US dollars. $0 to $200 in increments of $50. Color-coded comparisons of six platforms: OpenAI, Google, Anthropic, DS + Anthropic, DeepSeek, and x AI. Source: Gauthier (Aider), JP MAM. May 5, 2025.
(SPEECH)
Now, those are just exams. And exams don't have a ton of practical use in the real world. They're interesting things to look at. What's more interesting to us is, how do they do on coding, and these models are now being used to see how they can do in terms of writing and editing code.
And here this test looks at the ability of these models to execute over 200 tasks in multiple coding languages. Some of them are still only getting a little bit more than half of these exercises correct. But other ones like the o4-mini and Gemini 2.5 Pro-R, which is Google's product, is doing much better in the 70% to 80% range.
And remember, there's something important about models like this. If you're dealing with a system like air traffic control, self-driving cars, and interpreting people's MRIs, mistakes are catastrophic. And so a model that scores less than perfect is a problem.
But for most tasks that a lot of these things are being used for and might be used for, there's the ability to both apply human intervention and also other models to come in and clean up mistakes. So when the consequence of mistakes are not catastrophic, I think model success scores less than 100% can be perfectly viable and can still add to productivity.
(DESCRIPTION)
Slide: AI model coding ability. Dot plot titled Coding competitions. Codeforces Elo rating. Y-axis: 0 to 4,000 in increments of 500. X-axis: Model release date. 2022 to 2025 in increments of one year. Five GPT thresholds: 10th, 50th, 90th, and 99th percentile competitor, and top human competitor. Source: Noam Brown (OpenAI). May 2025.
(SPEECH)
Here's another score. Same thing. AI model coding competitions within OpenAI's own universe. When the reasoning models kicked in last year, the score started to go up substantially. So we get into all the details on this. That's enough of that technical stuff.
(DESCRIPTION)
Slide: AI task complexity. Dot plot titled Al task complexity. Time humans take to complete tasks AI models can complete all 50% accuracy minutes (log scale). Y-axis: 0.015625 to 1024 in an array of increments. X-axis: Model release date. 2019 to 2025 in increments of one year. Five GPT thresholds: Answer question; Search Wikipedia; Write email; Analyze data; and Add website feature. Source: Kwa et al. - (METR). Nature, March 2025.
(SPEECH)
Which actually one more, because I think this is important to how complex a task can these models try to tackle.
When we first started talking about these models a couple of years ago, we were using them as benchmarks against just looking for an answer to a question in Wikipedia, whereas now we're asking them to write emails, create websites, analyze data sets. And so a recent paper in Nature magazine looked at how long these models can stay on track while working through some very complex, multi-step problems. And these things have improved a lot.
So if you want more information on that-- now, the models still struggle with certain real world issues.
(DESCRIPTION)
Slide: Where do models still struggle? Bar graph titled Ability to solve real world GitHub issues. Y-axis: Percent resolved: 0% to 55% in increments of 5%. X-axis: Eight color-coded bars comparing four platforms: Anthropic, OpenAI, Google, and x AI. Source: Epoch AI, SWE-Bench, JP MAM. May 5, 2025.
(SPEECH)
There are certain, I think, more valuable tests that look at things like, can they fix bugs or add features in GitHub repositories? So far, only Anthropic's Claude product has more than a 50% score on this. And there's other examples as well. There's something called humanity's last exam, where most of the models still do quite poorly.
And then I also asked GPT-4 to draw a map of Europe, and it did a hysterically and hilariously bad job, even after when I asked it to try and fix it, it labeled London as the city as Bland. Now, no argument there. But that's a mistake. And that's in the appendix of the written piece. I think the most important thing is other than the coding exercises, none of these benchmarks really have any impact on a chief technology officer that's thinking about enterprise adoption of AI or things to drive their business impact through enterprise use cases.
So that's what we're going to look at next. Enough with all of these theoretical exams and things. What's going on in the real world?
(DESCRIPTION)
Slide: Hallucinations. While some of OpenAl's newer reasoning models score well on specific stylized exercises that it has been explicitly trained on, these models also exhibit very high hallucination rates in broader exercises that they have not been trained on. This is Goodhart's law in spades. - Table titled OpenAI hallucination evaluations. Five columns: Dataset, Metric, o3, o4-mini, and o1. Two rows: SimpleQA and PersonQA. Source: "OpenAI o3 and o4-mini System Card," OpenAI. April 16, 2025.
(SPEECH)
Now, to be clear, a lot of these tests and exams and tasks are things that a lot of these models are scoring well on, but only after all of the AI model builders torture their models to do well on them. And so we also have to take a look at the hallucination rates that some of the new regime models are experiencing in the wild when they're not working on certain preset exercises, and they are very high.
This hallucination right now is a big problem for reasoning models. One possible explanation is that some of them are recursively sampling base models that have single digit hallucination rates, but if you keep sampling it multiple times, you're going to end up with a very high hallucination rate. Some people think this is more of an engineering problem than a science problem. There may be paths around it. But the bottom line is look at this table. The hallucination rates of OpenAI's suite of reasoning models is roughly in the neighborhood of 50%.
And sometimes it provides broken links. Sometimes it even describes steps that it did in the interim, computations that it didn't even do. I mean, it tells lies and falsehoods that my four-year-olds used to do, and that's how easily some of them are identified. So I think it's important to understand that some of the improved proficiency that we're seeing on the prior charts and pages are things they've been trained to do better at. Whereas on a used on a random, broad basis, by the rest of us, by the diaspora of users, we're going to be much more subject to hallucination risks.
And this is something called Goodhart's law in Spain. Goodhart's law means once a benchmark becomes widely acceptable, it tends to lose its value because people game and manipulate the outcome. The best example is in colonial India, they had a problem with cobras, and so they paid people to bring in cobras so they could kill them. So then people started to breed cobras so they could bring them in and get paid for their cobras, and they ended up with too many cobras.
Anyway, it's very important to understand that the hallucination risks for the reasoning models are still very high, and there's still a problem that has to be solved. It means also that in corporate applications, all of those hallucination risks have to be bred out of whatever process that those reasoning models are used for.
(DESCRIPTION)
Slide: Anecdotal signs of AI adoption in real world business use cases. Three bar charts measuring percentage of respondents: Expected 3-year reduction in employees from GenAI; Cost reductions in the past 12 months from GenAI use ; and Revenue increase in past 12 months from GenAI use. Source: McKinsey, JP MAM. March 2025.
(SPEECH)
Well, I'm not a huge fan of McKinsey surveys. I think there's questions about rigor and thoroughness and things like that. My favorite study about consultants came out a few years ago and shows that when you hire consultants like McKinsey, the most likely outcome is that six or nine months later, you're still hiring consultants like McKinsey. Anyway, that said. They did a survey of about 1,500 companies and asked them questions about how much over the next three years, do you think you're going to reduce employees? How much are you going to reduce overall costs? How much your revenue is going to go up?
And to simplify this survey, the good news is that around 50% of all respondents said that they expect employee cost reductions and revenue increases from adoption of generative AI. That's the good news. The bad news is the most frequent answer in almost each case, of the people who said it was going to help, was the smallest amount of help. In other words, the most frequent three year reduction expected was 3% to 10% of employees, rather than 11% to 20% or more than 20. Et cetera. Same story with revenue. The most frequent response was it'll help us by less than 5%.
But that said, it does show that AI adoption is increasing in real world business cases.
(DESCRIPTION)
Slide: Anecdotal signs of Al adoption in real world business use cases. A bar graph titled Generative AI adoption, October 2023 versus December 2024. Adoption rate. X-axis: 0% to 70% in increments of 10%. Fifteen categories ranging from Software to HR. Color-coded comparison. Source: Bain Generative AI Survey, JP MAM. December 2024.
(SPEECH)
And we're getting the same story from a Bain survey that was just completed that looked at the change over the last year, the AI adoption cases have gone up by 50% to 60%.
(DESCRIPTION)
Slide: Anecdotal signs of AI adoption in real world business use cases. Bar chart titled Census: AI adoption rates by sector and date. Share of firms using AI. Y-axis: 0% to 35% in increments of 5%. The bars compare rates in the Next 6 months, May 2025, and August 2024 in fifteen categories. Source: US Census Bureau, JP MAM. May 2025.
(SPEECH)
And the census also does an interesting survey where they look at adoption rates by sector over time, and you're starting to see adoption rates of 20% to 30% in some of the sectors where you'd expect to see them.
I thought an interesting anecdote was, and this is from one of our AI researchers, there's a Mexican used car platform that told our researchers they replaced their entire outbound sales team with an AI voice model powered by a generative AI platform. And most of the time, customers can't even tell they're speaking to a voice bot. And the voice bot does a better job converting customers than the human baseline they're comparing it to.
So
(DESCRIPTION)
Slide: Sharp increase in the number of FDA-approved medical devices that rely on AI & ML. Bar chart titled FDA-authorized AI & machine learning medical devices. Y-axis: Number of devices approved per year: 0 to 250 in increments of 50. X-axis: 1995 to 2022 in increments of three years. Sharp increase begins in 2016. Source: FDA, JP MAM. 2024.
(SPEECH)
another couple of things. There's been a sharp increase in FDA approved medical devices that rely on AI. And I want to finish up with something about the FDA at the end. But there's been an increase in the use of medical devices.
(DESCRIPTION)
Slide: Is AI impacting the job market? Line graph titled Recent graduate employment gap. All workers unemployment rate - recent graduate unemployment rate. Y-axis: negative 2% to positive 3% in increments of 1%. X-axis: 1990 to 2025 in increments of five years. Source: Census Bureau, BLS, JP MAM. March 2025.
(SPEECH)
And then there was an article in The Atlantic that I thought was interesting that proposes the idea that AI is starting to impact the job market. And they show this chart as an example of that, which is for the better part of the last 30 years, the overall unemployment rate was higher than the unemployment rate for recent college graduates.
Whereas since 2020, which is before some of this AI stuff really got kicked in, but since 2020 that number has been falling. So recent graduates have higher unemployment rates than the overall market. And what a recent college graduates do? They summarize information. They aggregate data and they create charts and tables and graphs. And if that's what AI is getting better at, and if that's what AI is being used for, that would help explain this. So this is indirect evidence at best.
(DESCRIPTION)
Slide: Hyperscaler capital spending.
(SPEECH)
Now, last September when I last did an AI update. I expressed a lot of concern that the hyperscalers were spending a ton of money, and that we would pretty soon need to see hard evidence of them starting to get a return on all that investment. And at the time, I cited an analysis by a guy named David Kahn at Sequoia, where he backed into using his own assumptions how much the industry would need to earn every year, assuming certain capital spending gross margin of the hyperscalers. And he got a number like $500 billion a year in annual incremental AI revenue.
Now, he assumed the requirement for very rapid payback period. But even if you relax that constraint, you need some very big figure AI revenues for these companies, given that they're spending hundreds of billions of dollars on CapEx and R&D every year. So what are we seeing? Well,
(DESCRIPTION)
Slide: Hyperscaler capital spending. Line graph titled Hyperscaler capex and R&D as a share of revenues - Y-axis: 15% to 70% in increments of 5%. X-axis: 2017 to 2025 in increments of one year. Color-coded lines representing four color-coded platforms: Meta, Microsoft, Alphabet, and Amazon. Arrow points to Meta's sharp ascent as Metaverse money incineration. Source: Bloomberg, JP MAM. Q1 2025.
(SPEECH)
first of all, the hyperscaler capital spending in R&D as a share of revenue figures are starting to creep up again. So in 2022 to 2024 for Microsoft, Google, Amazon, those numbers were plateauing at around 25%. So in other words, they were spending more but they were also earning more.
Now those numbers are starting to go up and then we're at 30% to 35%. So we have to start to watch this because it seems like capital spending growth is overtaking the overall revenues of these businesses. Now,
(DESCRIPTION)
Slide: Hyperscaler AI revenues. Line graph titled Microsoft Azure quarterly revenue from AI. Y-axis: $0.0 billion to $1.35 billion in increments of $.5 billion. X-axis: December 2022 to December 2024 in increments of six months. March 31, 2025 trailing 1-year growth rate: 154%. Source: Bloomberg JP MAM. Q1 2025.
(SPEECH)
the good news is Microsoft gave us some clues. As far as we can tell, they're the only one of the hyperscalers that's giving you hard data on how much they're earning from AI. On a trailing one year basis, it looks like 3 to $3.5 billion, obviously growing substantially in terms of year on year rates.
And then another interesting observation. They said they processed 100 trillion tokens in Q1 2025, 50 trillion of them were just in March. Obviously, that's a super jargony thing for them to say, but when you translate it into English, based on what we know about the way these models work, it means that there's a lot of inference activity going on and not just model training. So there's an inference models that are typically a sign that corporations are adopting AI models, using them in actual workflows. So that was a good sign too.
(DESCRIPTION)
Slide: Hyperscaler cloud revenues. Line graph titled Hyperscaler quarterly revenue from cloud services. Y-axis: $0 billion to $30 billion in increments of $5 billion. X-axis: 2021 to 2025 in increments of one year. Five color-coded lines representing five platforms: Amazon Web Services, Microsoft Azure, Oracle Cloud, Google Cloud, and IBM Cloud. Source: Bloomberg, JP MAM. Q1 2025.
(SPEECH)
And for the other hyperscalers, their AI revenues are buried inside the cloud and we're starting to see quarterly revenues, particularly for Amazon, Microsoft, and Google start to pick up, although they're flat for Oracle and IBM. So we're seeing some evidence that AI revenues are picking up when we look at the cloud.
(DESCRIPTION)
Slide: Hyperscaler free cash flow margins. Line graph. Y-axis: negative 20% to positive 50% in increments of 10%. X-axis: 2017 to 2025 in increments of one year. Color-coded lines representing four platforms: Meta, Microsoft, Alphabet, and Amazon. Source: Bloomberg JP MAM. Q1 2025.
(SPEECH)
But at the end of the day, the most important chart is this highly cyclical one. But it's what's going to go on with how long can these hyperscalers keep this going? And that's going to be a function of their overall free cash flow margins. Amazon is still negative, but for Meta and for Microsoft and for Google, they're still hanging in with 20% to 30% free cash flow margins.
As long as that's the case, I think they'll be able to keep these capital spending wars going. But that's the thing we have to watch the most. If you start to see a sustained dip for Meta, Microsoft, and Google below the 20% level in terms of free cash flow margins, I think the markets would be very concerned that the capital spending is getting way ahead of the AI revenue generation. So what happens next?
(DESCRIPTION)
Slide: What happens next.
(SPEECH)
Our people, you know we have a lot of AI going on inside the company. Our people tell me there's too much focus on software developers. They only spend around 30% of their time actually coding.
(DESCRIPTION)
Text: Higher for junior developers, lower for senior developers.
(SPEECH)
So even if you improve their productivity in coding by 50%, you're still only talking about a 15% productivity improvement. And they think the larger gains instead of coding per se from generative AI are things like software maintenance, unit testing, integration testing, performance monitoring. These are harder things to measure, but they think the savings potential is much greater.
Behind the scenes, I think Microsoft and Amazon are likely to accelerate efforts to build their own foundational models. There's a lot of stuff going on that you can read about each week in the press on the ongoing divorce between OpenAI and Microsoft, Amazon, Google and Microsoft are trying to manufacture their own GPU-like chips to break NVIDIA's stranglehold on the Market
(DESCRIPTION)
Text: Tranium/Inferentia, Tensor, and Maia, respectively.
(SPEECH)
and there's a bunch of AI adoption milestones to watch for over the next couple of years in terms of self-driving cars and drones and multimodal AI used in entertainment, personalized AI assistant, things like that.
(DESCRIPTION)
Slide: Biggest experiment ever. Line graph titled Hyperscaler capex and R&D as a share of revenues. Y-axis: 0% to 45% in increments of 5%. X-axis: 1995 to 2022 in increments of five years. Color-coded lines representing three platforms: Microsoft, Alphabet, and Amazon. Source: Bloomberg, JP MAM. Q1 2025.
(SPEECH)
But at the end of the day, looking back to the 1990s, this is the biggest capital spending experiment on record by the tech sector. We're now setting new, consistent highs in terms of capital spending and R&D as a share of revenues. And so the bottom line is this thing better work. And the AI adoption rates are going to have to continue to start going up pretty soon.
(DESCRIPTION)
Slide: The MMR vaccine and the US measles outbreak.
(SPEECH)
I did mention, AI and machine learning approved involved medical devices that have been approved by the FDA. Speaking of the FDA, I don't know if I should be drinking so many frescas. I have no idea if that's healthy or not. So I did want to mention one thing about the FDA the drug approval rates have fallen in half, at least in Q1 of this year.
And I was thinking about that recently when I found out people immunized between 1963 and 1967 for measles, mumps, rubella received an inactivated version of the vaccine. And it was just that brief period because the live vaccine wasn't approved, pardon me, until 1967.
So the problem is the inactivated version of the vaccine gives you much lower immunity than the live one. And some of the studies show that after getting the vaccine, only a quarter of the people still had detectable antibodies at some point later. So the CDC is recommending that individuals vaccinated during that period get a new live vaccine. But for certain medical conditions, like one I happen to have you can't get live attenuated the live vaccines because they've got the live attenuated viruses and they're not good for people that have certain immune deficiencies.
So now people like me and others are being negatively impacted by all the people that are deciding that they don't want to get vaccinated anymore for MMR, even though it's 97% effective against the spread of measles. And just to give you a sense, the infectiousness measure of COVID and the flu is 1 to 2. For polio and smallpox that's 5 or 7. And for measles it's 12 to 18.
But what's going on in the country is over the last 10 years, the vaccination rates have fallen from, nationwide, 95% to 92%. But there's a whole bunch of states below 90%. Georgia, Colorado, Wisconsin, Alaska, and Idaho has fallen to 80%. And in Gaines County, Texas, where a lot of the measles cases have occurred, vaccination rates are 80% and one school district is below 50%
And there was a study recently from Stanford that estimated that measles could become endemic again within two decades, given these declines in vaccination. At the same time, instead of consistently messaging the importance of the vaccine, RFK Jr. has directed health agencies to explore potential new treatments for people to get vaccine, to get measles, including vitamins and cod liver oil. I think it's a good time for me to stop this podcast right there. Thank you for listening, and we'll see you again next time. Bye
(DESCRIPTION)
Logo: J.P. Morgan.
Read or listen to Back to our Regularly Scheduled Programming
About Eye on the Market
Since 2005, Michael has been the author of Eye on the Market, covering a wide range of topics across the markets, investments, economics, politics, energy, municipal finance and more.