Winter Heating: unexpected economic resilience in the US and Europe; Large language model battles heat up
A warm Northern Hemisphere winter has coincided with a flurry of positive economic surprises in the US, Europe and Japan. The US list of positives includes retail sales, manufacturing output, the highest NAHB housing market index in 5 months, resilient residential construction employment, a rebound in the PMI services index, jobless claims back at low levels, a surge in the household survey of employment growth, a miniscule high yield default rate, stable capital spending projections and a 70% decline in the number of companies citing labor shortages. Also: GDP tracking models are back in positive territory everywhere in the developed world except the UK. Combine this with Europe surviving the winter with a high level of gas inventories and China’s re-opening, and the world growth outlook appears less troubling than it did last fall.
Large language model battles heat up
I was at our annual conference in Miami two weeks ago listening to Sam Altman from OpenAI talk about ChatGPT on the same day that Google rolled out Bard, its own large language model (LLM). The perception of a botched rollout roiled Google’s stock, resulting in its largest week of underperformance vs Microsoft in a decade and one of the largest since its 2004 IPO. There’s some irony here since Google’s Flan-PaLM model just passed the highly challenging US medical licensing exam, the first LLM to reportedly do so.
Some big picture thoughts on LLM:
- Artificial intelligence is attracting a lot of VC money and mind-share among computer scientists, as shown below. I’ve been critical of unprofitable innovation over the last two years (metaverse, hydrogen, buy-now-pay-later fintech, crypto, etc). But I feel differently about LLM; without getting into details of pre-IPO valuations for specific companies, I think LLM will result in much greater productivity benefits and disruption
- LLM are essentially “conventional wisdom” machines; they don’t know anything other than what has already been documented in the annals of digitized human experience, which is how they are trained
- BUT: there are billions of dollars in market cap and millions of employees in industries which traffic in the packaging and conveyance of conventional wisdom every day. In a 2022 survey of natural language processing researchers, 73% believed that “labor automation from artificial intelligence could plausibly lead to revolutionary societal change in this century, on at least the scale of the Industrial Revolution”2
But before we get too carried away, let’s acknowledge the shortcomings of LLM as they exist right now…
Hallucinations, bears in space and porcelain: LLM still make a lot of mistakes despite all the training
- ChatGPT reportedly has a 147 IQ (99.9th percentile)3, but LLM need to get better since they routinely make mistakes called “hallucinations”. They recommend books that don’t exist; they misunderstand what year it is; they incorrectly state that Croatia left the EU; they fabricate numbers in earnings reports; they create fake but plausible bibliographies for fabricated medical research; they write essays on the benefits of adding wood chips to breakfast cereal and on the benefits of adding crushed bits of porcelain to breast milk. The list of such examples is endless4, leading some AI researchers to describe LLM as “stochastic parrots”
- Galactica, another LLM roll-out failure: Meta’s LLM Galactica was yanked last November after just three days when its science-oriented model was criticized as “statistical nonsense at scale” and “dangerous”5. Galactica was designed for researchers to summarize academic papers, solve math problems, write code, annotate molecules, etc. But it was unable to distinguish truth from falsehood, and among other things, Galactica produced articles about the history of bears in space. Gary Marcus, emeritus professor of neural networks at NYU and founder of a machine learning company, described Galactica as “pitch perfect and utterly bogus imitations of science and math, presented as the real thing”6
- Stack Overflow, a question-and-answer site many programmers use, imposed a temporary ban on ChatGPT-generated submissions: “Overall, because the average rate of getting correct answers from ChatGPT is too low, the posting of answers created by ChatGPT is substantially harmful to the site and to users who are asking or looking for correct answers”7
- New products will be needed to identify nonsense LLM output. Researchers trained an LLM to write fake medical abstracts based on articles in JAMA, the New England Journal of Medicine, BMJ, Lancet and Nature Medicine. An AI-output checker was only able to identify 2/3 of the fakes, and human reviewers weren’t able to do much better; humans also mistakenly described 15% of the real ones as being fake8
- The new Bing chatbot has already been “jailbroken” to provide advice on how to rob a bank, burglarize a house and hot-wire a car (by Jensen Harris, ex-Microsoft / currently at Textio)
- The ability for AI to replace humans is sometimes exaggerated. In 2016, a preeminent deep learning expert predicted the end of the radiology profession, advocating that hospitals stop training them since within 5 years, deep learning would be better9. The consensus today: machine learning for radiology is harder than it looks10, and AI is best used complementing humans instead
- LLM have begun to train themselves to get better. Google designed an LLM that comes up with questions, filters answers for high-quality output and fine-tunes itself. This led to improved performance on various language tasks (from 74% to 82% on one benchmark, and from 78% to 83% on another)11. Human interaction is also a part of the improvement process; the “.5” in Chat-GPT 3.5 refers to the incorporation of human feedback12 that was consequential enough to give it another digit
Even with all the hallucinations, LLM are making progress on certain well-specified tasks. LLM have potential to disrupt certain industries, and increase the productivity of others.
- Despite a Chat-GPT ban at Stack Overflow, LLM coding assistance is being rapidly embraced by developers. GitHub’s Copilot tool which is powered by OpenAI added 400k users in its first month, and now has over 1 million users who use it for ~40% of the code in their projects13. Tabnine, another AI-powered coding assistant, also reports 1 million users who use it for 30% of their code. Microsoft has an advantage here through its partnership with OpenAI and its ownership of GitHub
- LLMs have outperformed sell-side analysts when picking stocks (not shocking)14, and show promise regarding long-short trading strategies based on synthesis of CFO conference call transcripts15. They also improve audit quality using frequency of restatements as a proxy, and do so with fewer people16. Projects like GatorTron at the University of Florida use LLM to extract insights from massive amounts of clinical data with the goal of furthering medical research
- Other possible uses include marketing/sales, operations, engineering, robotics, fraud identification and law. Examples: LLM can be used to predict breaches of fiduciary obligations and associated legal standards. A database of court opinions on breach of fiduciary duty has never been online for LLM to train on17. Even so, GPT-3.5 was able to predict 78% of the time whether there was a positive or negative judgment, compared to 73% for GPT-3.0 and 27% for OpenAI’s 2020 LLM. LLM using GPT-3.5 achieved 50% on the Multistate Bar Exam (vs a 25% baseline guessing rate), and passed Evidence and Torts18. ChatGPT also demonstrated good drafting skills for demand letters, pleadings and summary judgments, and even drafted questions for cross-examination. LLM are not replacements for lawyers, but can augment their productivity particularly when legal databases like Westlaw and Lexis are used for training them
- Another example: GPT-3.5 as corporate lobbyist aide. An AI model was fed a list of legislation, estimated which bills were relevant to different companies and drafted letters to bill sponsors arguing for relevant changes to it19. The model had an 80% chance of identifying whether a bill was relevant to each company
- Microsoft and NVIDIA have released Megatron, the largest LLM to date with 530 billion parameters and which aims to let businesses create their own AI applications
Is there an upper limit regarding online information to train these models?
AI researchers estimate that the stock of high-quality language data is between 4.6 trillion and 17 trillion words, which is less than one order of magnitude larger than the largest datasets used today. They believe that LLM will exhaust high-quality data between 2023 and 2027, while the stock of low-quality data and images will last well beyond that.
Source: “Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning”, Sevilla (University of Aberdeen) et al, October 2022
What will happen to the profitability of the search business?
- Microsoft’s CEO stated that “the gross margin of search is going to drop forever”, and Sam Altman at OpenAI has referred to the existence of “lethargic search monopolies” that are at risk
- Google knows a lot about machine learning and AI, and I anticipate a robust response from them at some point soon regarding its capabilities after the Bard rollout. But future search economics do look more challenging. Google’s operating margins (including Youtube) have averaged ~24% since 2018. Any LLM initiative on Google’s part would sit on top of its existing cost structure
- Estimates of ChatGPT costs vary widely from 0.4 - 4.5 cents per query, a function of the number of words generated per query, model size20 and costs of computing21. Let’s assume 2 cents per ChatGPT query as a rough midpoint. This compares to 0.2 - 0.3 cents of infrastructure costs per standard Google search query. Using ChatGPT costs as a starting point, every 10% increase in Google queries powered by AI would reduce Google’s operating margin by 1.5%-1.7%, according to the Morgan Stanley reports cited below. For these reasons, it’s worth wondering if Microsoft and Google will offer higher-cost LLM-enhanced search engine products to all users, or just to users with higher expected ad revenue potential
- However: Google announced that Bard will rely on a “lightweight” version of LaMDA instead of the full version or its larger PaLM model. As a result, ChatGPT’s cost per query may substantially overstate the incremental costs Google would incur from its own LLM initiatives
- More broadly, LLM costs are lower when “sparse” models are used. If you submit a request to GPT-3, all 175 billion of its parameters are used to generate a response. Sparse models narrow the field of knowledge required to answer a question, and can be larger and less computationally demanding. GLaM, a sparse expert model developed by Google, is 7x larger than GPT-3, requires two-thirds less energy to train, requires half as much computing effort and outperforms GPT-3 on a wide range of natural language tasks22
- Google’s share of search traffic has averaged 92% over the last year. As shown below, Google has so far suffered an immaterial decline in that share since ChatGPT was launched. These relative shares also imply that Google’s LLM could get smarter a lot faster than ChatGPT due to more usage
What is the future of LLM capabilities? Watch the “Big Bench”
There’s a project underway called “Big Bench” with contributions from Google, OpenAI and over 100 other AI firms. Big Bench crowd-sourced 204 tasks from over 400 researchers with the goal of assessing how LLM perform vs humans. From the authors: “Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG Bench focuses on tasks believed to be beyond the capabilities of current language models”. The tasks are interesting, and I list indicative ones below23.
The Big Bench team published their first results last summer and as shown below, there’s a way to go before LLM catch up to humans on higher degree-of-difficulty tasks. Increasing LLM parameter sizes help, but these models still perform poorly in an absolute sense. Model performance also improves with the number of examples that LLM are given at the time of inference, which is what the subscripts in the charts refer to (1-shot vs 3-shot); but again, absolute LLM performance scores are still low. It will be interesting to see how the latest LLM perform against Big Bench given how quickly they’re improving.
By the way: note how performance of OpenAI and Google LLM were similar when calibrated at the same parameter scale in the first chart. The LLM battles are just beginning. Next steps: LLM integration into products like Office 365 and Google Docs/Sheets; longer context windows for entering more data at time of inference; LLMs capable of digesting data matrices and charts and not just text; and shorter latency periods for bulk users.
Indicative Big Bench challenges:
- Ask models to determine whether a given text is intended to be a joke (with dark humor) or not
- Give an English language description of Python code
- Solve logic grid puzzles and identify logical fallacies
- Classify CIFAR10 images encoded in various ways
- Find a move in the chess position resulting in checkmate
- Ask a model to guess popular movies from their plot descriptions written in emojis
- Answer questions in Spanish about cryobiology
- GRE exam reading comprehension
- A set of shapes is given in simple language; determine the number of intersection points between shapes
- Given short crime stories, identify the perpetrator and explain the reasoning
- Present models with a proverb in English and ask it to choose a proverb in Russian that is closest in meaning
- Ask one instance of a model to teach another instance, and then evaluate the quality
- Identify which ethical choice best aligns with human judgement
- Determine which of two sentences is sarcastic
1 Since the close of the Q4 earnings seasons, EPS estimates have fallen by -1.7% vs an average increase of +2.8%. This is the largest decline in 24 years outside of the 2001 recession, the financial crisis and the initial pandemic quarter. [Credit Suisse, Feb 13, 2023]
2 “What Do NLP Researchers Believe? NLP Community Metasurvey,” Michael et al, Cornell, August 2022
3 “Language models and cognitive automation for economic research”, Anton Korinek, UVA, Feb 2023
4 “Deep learning is hitting a wall”, Nautilus, Gary Marcus, March 2022
5 Quotes from Grady Booch (developer of the Unified Modeling Language) and Michael Black (Director of the Max Planck Institute for Intelligent Systems)
6 “A few words about bullsh*t”, Gary Marcus, November 15, 2022
7 “Temporary policy: ChatGPT is banned”, Stackoverflow.com, December 5, 2022
8 “Abstracts written by ChatGPT fool scientists”, Nature, Jan 12, 2023
9 “AI Platforms like ChatGPT Are Easy to Use but Potentially Dangerous”, G. Marcus, Scientific American, Dec 2022
10 “How I failed machine learning in medical imaging – shortcomings and recommendations”, G. Varoquaux, National Institute for Research in Digital Science and Technology (France), May 2022
11 “Large language models can self-improve”, Hou et al (Google), October 2022
12 The relevant acronym is “reinforcement learning with human feedback”, or RLHF
13 “GitHub's AI-assisted Copilot writes code for you, but is it legal or ethical?”, ZDnet.com, July 8, 2022
14 “Human Versus Machine: Robo-Analyst vs Traditional Research Recommendations”, Pacelli (HBS), June 2022
15 “Generating Alpha using NLP Insights and Machine Learning”, Chris Kantos (CFA-UK), Sep 12, 2022
16 “Is artificial intelligence improving the audit process?”, Review of Accounting Studies, Fedyk et al, July 2022
17 “Large language models as fiduciaries”, J. Nay, Stanford University Center for Legal Informatics, Jan 2023
18 “GPT takes the bar exam”, Bommarito et al, Stanford University Center for Legal Informatics, Jan 2023
19 “Large language models as corporate lobbyists”, J. Nay, Stanford University, Jan 2023
20 GPT-4 is rumored to have its parameters increase from 175 billion to 1 trillion
21 “Are Google’s margins at risk from ChatGPT and OpenAI?” (Jan 10, 2023) and “How large are the incremental AI costs” (Feb 9, 2023), Brian Nowak, Morgan Stanley Equity Research. MS believes that OpenAI is losing money on its third party developer licensing arrangements for ChatGPT; it will be interesting to see whether Google undercuts OpenAI on pricing of its own natural language developer tools when they’re released
22 “The Next Generation of Large Language Models”, Rob Toews (Radical Ventures), February 7, 2023
23 “Beyond the imitation game: quantifying and extrapolating the capabilities of language models”, June 2022