by Tepix 3 days ago

Interesting. How fast is your service? Do you guarantee a certain number of tokens/s?

sacrelege 3 days ago | [-2 more]

We typically observe throughput of around 100–110 toks/s, and for larger context sizes this ranges between 90–100 toks/s.

While we don't guarantee a fixed toks/s rate, we scale by provisioning external GPU nodes during peak demand. These nodes run our own dockerized environment over a secure tunnel.

Our goal is to ensure a consistent baseline performance of at least 60–80 toks/s, even under high load.

Tepix 3 days ago | [-1 more]

Sounds good. I saw that you use the FP8 version of the model. Do you also quantize the KV cache?

sacrelege 3 days ago | [-0 more]

no I don't, since there seem to be a silent degradation bug