Is OpenAI imposing a Token Tax on Non-English Languages?

Is OpenAI imposing a token tax on non-English languages? The token count on international languages are significantly higher than English. This is a serious accessibility and equity issue for the global majority who use languages other than English.

I did an experiment using OpenAI’s Tokenizer: Get the token count for the same text in different languages, adjusted for character count. Here are the results:

  • ?? English: 105 tokens
  • ?? Spanish: 137 tokens
  • ?? French: 138 tokens
  • ?? German: 138 tokens
  • ?? Dutch: 144 tokens
  • ?? Norwegian: 157 tokens
  • ?? Hungarian: 164 tokens
  • ?? Arabic: 286 tokens

For ChatGPT users this means international users will hit the token window limit faster resulting in higher risk of “hallucinations,”, reduced response quality in longer conversations, and reduced ability to process larger volumes of text.

For OpenAI API users this also means significantly higher cost for every prompt and response as they are charged per token.

For English, the tokenizer counts most words as one token. For non-English languages, most words are counted as two or more tokens. This is likely because English dominates the training data GPT was built on, and results in an effective token tax on non-English languages, especially smaller and more complex languages.

As AI becomes an ever more present component of our lives and work, this token tax poses a significant equity problem disadvantaging the global majority and anyone not using English as their primary language.

By Morten Rand-Hendriksen

Morten Rand-Hendriksen is a Senior Staff Instructor at LinkedIn Learning (formerly specializing in AI, bleeding edge web technologies, and the intersection between technology and humanity. He also occasionally teaches at Emily Carr University of Art and Design. He is a popular conference and workshop speaker on all things tech ethics, AI, web technologies, and open source.