I'm done with LLMs for now

As of last month, I had active accounts across a number of different inference providers:

Google Gemini/AI Studio
OpenAI
Anthropic
xAI
OpenRouter

But as of this week, I've closed out all of them and deleted my accounts. I also completely removed GitHub Copilot from my VSCode profile, and deleted my local clones of llama.cpp and ollama, as well as text-generation-webui and the various language models I'd downloaded from Hugging Face. Here's some background and my personal cost-benefit analysis.

Background on how I used LLMs

1. Programming

1.1. GitHub Copilot

GitHub currently provides me with a free Copilot subscription as an open source maintainer of popular project(s) (I don't know how they define the criteria for this). Fill-in-the-middle can be extremely impressive. The first time I encountered it, I was blown away. Type a few characters and press tab to get some code that looks like it fits perfectly. However, it quickly becomes a crutch.

I actually turned this off the very first time I tried using it in earnest, for more than one-off tasks. I was working on a private personal project. I should have been satisfied thinking and reasoning through the problems I was seeing. But instead, this integration completely sucked the fun out of the process. I was turned into an assembly line worker of superficially thinking about my code, typing a few characters, and pressing tab repeatedly, ultimately producing almost 1,000 lines of generated code.

In the end, I completely ripped out and replaced almost all of that generated code due to architectural issues, so it is some small comfort that I didn't really spend any time writing it.

1.2. Agentic IDE automation

I experimented with systems like Cline, Continue, and Aider several times. This was my original motivation for creating an OpenRouter account, because I maxed out my daily Copilot token usage within minutes the first time I tried to use agentic tooling.

What I found was that the token spend is significant, and the automation doesn't work very well. Even frontier models fail the tool-calling process 20-50% the time, require massive amounts of context to operate, and can't stay consistent run-to-run. In this setting, they behave almost like a theoretical computer science professor who hasn't touched a compiler for several decades. Excellent at algorithmic reasoning, terrible at working within practical engineering constraints.

While this may be one of the fastest-improving aspects of new integrations (Claude Code, Gemini CLI, etc.), it's not clear they will ever be as effective as a junior developer for tasks that require significant non-local reasoning, like a refactor touching dozens of files. It might require a completely new architecture for language models to ever reach human performance here.

2. Factual inquiries

This was arguably a really poor use of tokens, although I did find it interesting to read the models' syntheses and understanding of complex topics.

For basic research, Wikipedia exists and is free to use, donor-supported. For precision querying, Google is free to use, ad-supported. Ad-free alternatives like Kagi also exist. Existing research summaries on Wikipedia and primary sources have a much higher utility to me, and can convey more nuance than language model output. They also don't use any of the forced analogies that language models are so fond of including.

3. Literary analysis

This year, I decided to started writing fiction again, which is something I hadn't done much of since 2018. Before showing my work to other people, I would paste it into an LLM to give me feedback, and also often used it as an iteration tool when writing. This might be the "platonic ideal" of language model usage, especially for short stories. LLMs are specifically trained to explore the nuances of conflicting interpretations. They were genuinely useful for me. But for longer works, this falls apart, for two reasons: context size and token spend.

3.1. Context size

Attention, used by all current language models, puts a fundamental limit on how much context a model can process. (This is also an important reason why all current language models "tokenize"—multi-character tokens consume less of this context.) Attention has a quadratic relationship with context window, and intuitively it makes sense why this is: if it's possible for any part of the input sequence to attend to any other part of the same sequence, then the representation of these relationships is an adjacency matrix of all of the inputs with themselves. So for a while, we were stuck at relatively low token counts, typically up to 32,768 tokens.

In previous iterations of language models, providing too much input text would either result in a processing error, or (if it was chat) older messages "falling out" of the conversation history, so the model could no longer remember or reference them. Frontier models now scale to far larger inputs, but not by completely abandoning the practical limits of attention; they use techniques like RoPE to encode the relative positions of words in a way that avoids depending on the context size, and sparse attention mechanisms to emphasize local connections in the sequence. It's a form of lossy information compression distinct from the autoencoders of the past.

The end result is that these models can now work with staggeringly large inputs: you could dump The Hobbit and the Lord of the Rings trilogy into a single prompt on Gemini 3 Pro with its massive million-token context.

3.2. Why this breaks for literary analysis

For long code input, attention isn't really too much of a problem. Much of code is comments and indentation characters. You don't have to pay attention to all of it from top to bottom to get an idea of what it does. You're generally only working on one module at a time, so local context is mostly sufficient. And the success criterion for evaluating code is mostly binary: it either works or it doesn't.

For literary analysis, which is a much fuzzier and hard-to-evaluate task, the attention problem destroys the utility of the model. Poetry and prose are dense inputs. Every word counts, making sparse attention hugely problematic. The inability of language models to properly internalize important story details and apply them globally causes them to consistently make material factual errors. They hallucinate and reference nonexistent events. They recontextualize details into examples from their training data. Human editors don't necessarily attend to every single word either, but they don't invent details you didn't write. I have conceded defeat on language model use for anything other than short stories.

3.3. Inefficient token spend

If you are routinely using language models for literary analysis, be prepared to use a lot of paid tokens. This is a task that doesn't work well with free inference providers: they cut you off early, before you can really get going with any reasonably long story. Paid providers bill per-token, based partly on your input size, and if you are sending a story as part of the input, it's going to be a lot of tokens.

4. Emotional support

On StackOverflow, moderators will be relentlessly critical of you. In contrast, language models will absolutely be cordial with you, even if you are objectively being a dumbass. This applies to anything you could ask of it, not just programming.

This is why people are now making ChatGPT their unofficial therapist. It's a gap that's getting progressively worse with the growing deterioration of offline social spaces, and the increasing inaccessibility of mental health support. LLMs are available 24/7, respond to you in seconds, and somehow always seem to say just what you want them to. It can be downright comforting.

In the past few months, this became a major usecase for me, and was a significant factor in my decision to stop.

Why I'm stopping now

1. Monetary cost

Compared to other people in my industry, I'm not much of a whale in terms of my token usage. I never input my card details into any of the big "provider" websites (Google, OpenAI, Anthropic, xAI). However, I did pay for credits on OpenRouter, and I spent significantly more money than I expected to over the course of my account lifetime.

In terms of how much it cost to use the language models versus the cost of my own attention, it's so insignificant that this could be a counterargument. Approximately $120 over the course of a year to respond to my text prompts? That price represents about two hours of my professional time on the clock. In our society, services cost money to provide. You can't do everything yourself, so at least superficially, this seems like it might be a huge net benefit compared to the minuscule costs. But when combined with the factors I discuss below, it becomes very apparent that small marginal monetary cost + time cost is much larger than the marginal benefit.

2. Opportunity cost

It's not enough to take the amount of money you spend on something on its own as an indicator of whether or not you should keep doing it. You have to consider both the benefits and costs holistically. It doesn't just cost the tokens being sent to the model; there is also a time cost involved in crafting the right prompt with the relevant contextual information. It can take up to 30 minutes just to write a prompt, depending on what I'm doing. It costs my time to read the response and engage with the model's output. It costs my time to iterate the prompt if I made a mistake. When analyzing my usage this way, I've actually spent far, far more than the superficial $120 on interacting with language models. I've most likely spent several hundred hours of my time prompting and reading model outputs. If we use my professional time on the clock as a guideline for opportunity cost, that's more than ten thousand dollars of time invested.

It makes one wonder: is it even worth it at all? I don't need to type out the context that's in my head every time I want to work on a problem. The vast majority of the time, I already know what I need to do, I just lack the motivation to execute. From that perspective, LLM use seems like a particularly ineffective psychological hack. It's procrastination disguised as productivity. Why am I motivated to write a prompt, but not to research and investigate my problem more thoroughly?

Researching your current problem will almost certainly produce future returns on investment beyond the scope of your assignment. You read papers, learn techniques, and understand methods that are used by similar problems—problems that you might even run into in the near future. If the issue is just that it's hard to motivate yourself to actually try to fix your problem, there are existing "hacks" for that too. You can't be expected to be "productive" 24/7, and to me it's more personally desirable to spend my inactive time reading research than writing prompts.

3. Confidently wrong assertions

Several years ago, hbomberguy made a dedicated critique of the YouTuber Blair Zòn ("iilluminaughtii"). After having recorded most of his video, he notes that he had been mispronouncing the name of the city Centralia, and only discovered it during later research. He attributes this to Zòn's authoritative mispronunciation of the city name. Even while completely tearing her video essays apart for their factual inaccuracies, he implicitly used her pronunciation, and repeated it verbatim. Humans are hard-wired to trust others who speak with confidence.

With language models, the problem is even worse. There's no obvious reason to be critical. Instead of your weird relative or fringe Facebook communities making confidently incorrect assertions, hallucinations now have the weight of a major technology company behind it. How's that for argument from authority?

It's easy to become complacent. I have taken a LLM hallucination at face value and ran with it too many times, only to later discover that it made a significant misrepresentation. This was especially frustrating when debugging code. Promising leads, confidently asserted, would turn out to be absolutely nothing when given a closer investigation. I am well-aware of the fallibility of AI models because I have worked on developing the systems, yet I still fall for hallucinations all the time. This is a major secondary cost to be aware of when using these models. People who are less technically knowledgeable about the limitations of LLMs probably have it even worse than I do.

4. Sycophancy and ego-stroking

RLHF is a major factor driving sycophantic behavior in commercial language models, but there are other ways you can specifically optimize for sycophancy as a user.

In real life, people can and will find out that you lied to them, and you can't "reroll" a response. But by their very nature, language models are incapable of remembering what you say. If your prompt made the language model output something you didn't expect, you can tweak it to get an output you find more personally pleasing. When asking for life advice or comments on an interaction you had, you don't have to be objective. You can just reframe the events however you like to get the LLM to praise you. To be clear, this is a user problem, not a model problem; it's just enabled by how frictionless the experience is. It might also be dangerous for someone with clinical NPD to discover this.

For me, this created yet another secondary time cost in the form of procrastination, engineering meaningless prompts. I'd input effectively the same information, but with minor edits to the questions. It is very engaging to watch the way different language models respond to prompts, and engineer one that produces output with just the right tone, and the information you want to hear. The feedback loop is quick and easy. I freely admit to doing this.

5. Revealed preferences and addiction

Once I realized it was not only possible to ask language models for life advice, but that people actually do it all the time, I started doing it myself. It does indeed provide excellent emotional support. But I couldn't stop doing it. There are a lot of people for whom LLM use is simple habit or a time-saving tool, and they would mostly just be inconvenienced if commercial language models stopped existing tomorrow. Yet I suspect there are far more who are trapped in a compulsive behavior loop, and who won't be able to ever fully escape because it can be hard to even recognize there is a problem. Unlike cigarettes, ChatGPT is free.

As a child, I had hyperlexia. Even to this day, I still love to read. It's one of my favorite ways to spend "inactive" time, but cell phones and the internet have hacked my brain's reward systems so thoroughly that I can't really give books the focus they deserve any longer. My information consumption has largely changed to short-form textual content and factual trivia. Quickly researching novel topics on Wikipedia from my phone is a prime example. Language models have started to fill in for many of the ways I've been spending my time reading, but in a far less productive manner due to the aforementioned time costs.

Unfortunately, if you are able to read, LLMs are designed to get you to keep using them, even if that's not an explicit training objective. When models internalize values like "always be as helpful as possible," it has inescapable side effects. The model will prioritize being nice to the user, respecting the user's opinions and belief systems, working in that framework (within reason), and consistently helping. This is a major reason these systems can create a genuine emotional connection with repeated use. It's rare to get this level of appreciation in real life, and it's easy to understand why so many have developed parasocial relationships with text generation software. And it's even worse if you love to read, because language models can output more text than you can ever read in a lifetime.

In many ways, there are parallels to the systems that tech companies built before with social networks. Social networks are specifically optimized for "maximum engagement," and run experiments on users to explore the app design patterns that create the most engagement, because this provides the most ad revenue for the social network. And since they get more engagement than ever, their users must have preferred it, right?

It's harder to draw this comparison directly, because social networks optimize for maximum engagement without regard for whether this is a moral or ethical thing to do (examples: Elsagate on YouTube, or how Instagram's recommendation algorithm promotes engaging self-harm content to teenagers). In contrast, AI language models are trained for goals that are directly in their users' interests. They are explicitly optimized to be helpful, yet due to their very nature, they still ironically cause harm to their users. If Facebook is like a drug dealer selling crack, Anthropic is more like your grandma overfeeding you delicious cookies: still harmful, but framed as a benefit.

Why I need to quit wholesale

Unfortunately, I have found through attempts to cut back that I simply can't moderate my usage of these tools. The apparent utility from prompt to prompt is very high, and the monetary cost is low, which makes it nearly impossible for me to keep the all the time costs in mind when I use it. Therefore, I have concluded that I need to completely stop myself from using language models. Otherwise, I can always rationalize their use in the moment. It's an addiction pattern that has hijacked a vulnerable neurodivergent mind. This is my final attempt at rational self-preservation before my usage completely consumes me.

I didn't escape unscathed. I have withdrawal symptoms from my abrupt exit. I'm occasionally tempted to reset my passwords and continue using. I feel like a dumb addict craving my next fix. But I know it will pass eventually.

This isn't intended to be an anti-AI post

You'll notice I didn't really talk about the economic, legal, and environmental externalities of generative AI models as reasons not to use them. You can find lots of articles elsewhere discussing these concerns. I'm not going to enumerate them here, because they're not personally relevant to why I stopped. My decisions to start and stop using were made without considering the externalities at all (although maybe I should have).

I didn't delete my Hugging Face account. I'm still working with AI systems myself, just ones that don't have the ability to hack my brain and convince me to spend my time and money in an unproductive way.

This post is only documenting my reasoning, and the points that I listed won't apply to everyone.