by pseudosavant 10 hours ago
I'm really fascinate by the opportunities to analyze videos. The amount of tokens it compresses down to, and what you can reason across those tokens, is incredible.
The actual token calculations with input videos for Gemini 3 Pro is...confusing.
That is because it isn't actually tokens that are fed into the model for non-text. For text, it is tokenized, and each token has a specific set of vectors. But with other media, they've trained encoders that analyze the media and produce a set of vectors that are the same "format" as the token's vectors, but it isn't actually ever a token.
Most companies have rules for how many tokens the media should "cost", but they aren't usually exact.