by mkmccjr 8 hours ago

Thank you for this! On the second point, you are absolutely correct; that was sloppy writing on my part. I will correct that in the post.

I'm not certain I understand your first point. When I add the type hint, it's me asserting the type, not the compiler proving anything. If the value at runtime isn't actually a byte array, I would expect a ClassCastException.

But I am new to Clojure, and I may well be mistaken about what the compiler is doing.

zahlman 7 hours ago | [-3 more]

I mean, I think that probably is what happens. But then, while it's sped up a lot, the generated bytecode presumably also includes instructions to try the cast and raise that exception when it fails.

(And the possible "JIT hotpath optimization" could be something like, at the bottom level, branch-predicting that cast.)

mkmccjr 7 hours ago | [-2 more]

Aha, I think I better understand your point: since the generated bytecode includes a cast, my explanation about the optimization is too simplistic.

I haven't actually inspected the emitted bytecode, so I was only reasoning from the observed speedup.

Your point about branch prediction is really interesting; it would explain how the cast becomes almost free once the type is stable in the hot path.

I'm learning a lot from this thread -- thank you for pushing on the details!

EdNutting 2 hours ago | [-1 more]

Without seeing the actual differences in the bytecode it will be hard to tell what’s really going on. From my experience with other JITs, I’d expect the situation to be something like:

A) Without the typecast, the compiler can’t prove anything about the type, so it has to assume a fully general type. This creates a very “hard” bytecode sequence in the middle of the hotpath which can’t be inlined or optimised.

B) With the typecast, the compiler can assume the type, and thus only needs to emit type guards (as suggested in this thread). However, I’d expect those type guards to be lifted through the function as far as possible - if possible, the JIT will lift them all the way out of the loop if it can, so they’re only checked once, not on every loop iteration. This enables a much shorter sequence for getting the array length each time around the loop, and ideally avoids doing type/class checks every time.

This would avoid pressuring the branch predictor.

Most JITs have thresholds for “effort” that depend on environment and on how hot a path is measured to be at runtime. The hotter the path, the more effort the JIT will apply to optimising it (usually also expanding the scope of what it tries to optimise). But again, without seeing the assembly code (not just bytecode) of what the three different scenarios produce (unoptimised, optimised-in-test, optimised-in-prod) it would be hard to truly know what’s going on.

At best we can just speculate from experience of what these kinds of compilers do.

imtringued 30 minutes ago | [-0 more]

It's speculation because the author didn't show the byte code or even just what the code decompiles to in Java.

But even with speculation, it shouldn't be that surprising that dynamic dispatch and reflection [0] are quite expensive compared to a cast and a field access of the length property.

[0] https://bugs.openjdk.org/browse/JDK-8051447