ChatGPT has made a splash around the world with its ability to generate text that is indistinguishable from human-written text. It’s a GPT-based chatbot that can be used to generate text in a variety of languages, including English, Spanish, French, German, Italian, Russian, Chinese, Japanese, Korean, and more. In this article, we’ll discuss the official ChatGPT Token Calculator and how to use it.
What is ChatGPT?
We’ve discussed in-depth what ChatGPT is and how to use it in the Article “ChatGPT: your guide to AI Language Models”, feel free to give it a shot! In short though, ChatGPT is an advanced AI language model developed by OpenAI, based on the GPT-4 architecture.
This model has the ability to understand and generate human-like text, making it incredibly useful for various applications, such as content generation, chatbots, and more. Its proficiency in natural language processing allows it to grasp context, analyze text, and provide relevant, coherent responses.
What is a “token”?
A token is a unique identifier that is used to represent a specific word or phrase. Tokens are used to train AI language models such as ChatGPT, and they are also used to generate text. The more tokens a model has, the more accurate it will be.
Why are tokens important?
Using ChatGPT programatically, via OpenAI’s API, allows you to integrate human-like responses in an automated manner in your applications directly. OpenAI charges users based on the “size of the content” submitted to ChatGPT for understanding and processing via “tokens”. The more tokens you use, the more you pay.
It’s important to get a gross estimate of how many tokens you’ll be using in your application, so you can estimate the cost of using ChatGPT both for you and your applications. This is where the ChatGPT Token Calculator comes in handy.
Where’s the ChatGPT Token Calculator?
OpenAI has an official Token Calculator website which uses the same way of calculating it as the API does (and, as such, the way OpenAI uses to bill you for usage). You can find it at https://platform.openai.com/tokenizer
.
There are a few caveats though:
- There is currently no official support for calculating tokens based on ChatGPT-4. Only ChatGPT-3 is available through this website.
- It is a website, and it provides no API itself, which means you can’t use it programatically. There’s an official Python package that allows you to access this information programatically, available here on GitHub .
Are tokens the same as words?
The short answer is no. A word might be handled differently than a token. OpenAI explains this via a rule of thumb:
A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).
In a different way of understanding it, roughly, you can:
- Take any text
- Count the number of words in it
- Multiply the result by 1.33 to get a rough estimate of tokens
Do note that some special characters, non-alphanumeric characters, and other symbols might be counted as tokens as well and counted very differently, so this is just a rough estimate.
As an example, we took the first paragraph of this article, and ran it against the ChatGPT-3 Token Calculator, and also through the 1.33
rough estimation. Here are the results:
Method | Number of Words | Token Count |
---|---|---|
Token Calculator | 63 | 89 |
1.33 estimate | 63 | ~84 |
Note the OpenAI Tokenizer calculation below:
A couple of things might not be obvious but point to the difference in the estimation. For example, the word ChatGPT
does not count as a single token, but rather as three: Chat
being an English word, then G
, then PT
. We can only guess why G
and PT
are not considered as a single token together, but it’s likely due to the fact that G
is a single letter, and PT
is commonly used to refer to multiple things in the English language, such as “Pacific Time”, “Physical Therapy”, “Part Time”, and more. These might have shown up during the training of the model, and as such, it’s likely that the model has learned to separate them.
The same is true for our hyphened word “human-written”, which is counted as 3 separate tokens: human
, -
and written
. The we'll
contraction is also counted as two tokens: we
and 'll
.