My favorite thing about floating point numbers: you can divide by zero. The result of x/0.0 is +/- inf (or NaN if x is zero). There's a helpful table in "weird floats" [0] that covers all the cases for division and a bunch of other arithmetic instructions.
This is especially useful when writing branchless or SIMD code. Adding a branch for checking against zero can have bad performance implications and it isn't even necessary in many cases.
Especially in graphics code I often see a zero check before a division and a fallback for the zero case. This often practically means "wait until numerical precision artifacts arise and then do something else". Often you could just choose the better of the two options you have instead of checking for zero.
Case in point: choosing the axes of your shadow map projection matrix. You have two options (world x axis or z axis), choose the better one (larger angle with viewing direction). Don't wait until the division goes to inf and then fall back to the other.
This is also my favorite thing about floating point numbers. Unfortunately languages like Python try to be smart and prevent me from doing it. Compare:
This is totally false mathematically. Please look up the extended real number system for an example. Many branches of mathematics affix infinity to some existing number system, extending its operations consistently, and do all kinds of useful things with this setup. Being able to work with infinity in exactly the same way in IEEE754 is crucial for being able to cleanly map algorithms from these domains onto a computer. If dividing by zero were an error in floating point arithmetic, I would be unable to do my job developing numerical methods.
The article's point 3 says that this is a myth. Indeed, the _limit_ of `1/x` as `x` approaches zero from the right is positive infinity. What's more, division by _negative zero_ (which, perhaps surprisingly, is a thing) yields negative infinity, which is also the value of the corresponding limit. If you divide a finite float by infinity, you get zero, because `lim_{x\to\infty} c/x=0`. In many cases you can treat division by zero or infinity as the appropriate limit.
Sure, but it makes sense, doesn't it? Even `inf-inf == NaN` and `inf/inf == NaN`, which is true in calculus: limits like these are undefined, unless you use l'Hôpital's rule or something. (I know NaN isn't equal to itself, it's just for illustration purposes) But then again, you usually don't want these popping up in your code.
You still can't divide by zero, it just doesn't result in an error state that stops execution. The inf and NaN values are sentinel values that you still have to check for after the calculation to know if it went awry.
In the space of floats, you are dividing by zero. To map back to the space of numbers you have to check. It's nice, though; inf and NaN sentinels give you the behavior of a monadic `Result | Error` pipeline without having to wrap your numbers in another abstraction.
I believe divide-by-zero produces an exception. The machine can either be configured to mask that exception, or not.
Personally, I am lazy, so I don’t check the mxcsr register before I start running my programs. Maybe gcc does something by default, I don’t know. IMO legitimate division by zero is rare but not impossible, so if you do it, the onus is on you to make sure the flags are set up right.
Correct, divide by zero is one of the original five defined IEEE754-1985 exception. But the default behavior then and now is to produce that defined result mentioned and continue execution with a flag set ("default non-stop"). Further conforming implementations also allow "raiseNoFlag".
It's well-defined is all that really matters AFAIC.
You can exploit the exactness of (specific) floating point operations in test data by using sums of powers of 2. Polynomials with such coefficients produce exact results so long as the overall powers are within ~53 powers of 2 (don't quote me exactly on that, I generally don't push the range very high!). You can find exact polynomial solutions to linear PDEs with such powers using high enough order finite difference methods for example.
However, the story about non-determinism is no myth. The intel processors have a separate math coprocessor that supports 80bit floats (https://en.wikipedia.org/wiki/Extended_precision#x86_extende...). Moving a float from a register in this coprocessor to memory truncates the float. Repeated math can be done inside this coprocessor to achieve higher precision so hot loops generally don't move floats outside of these registers. Non-determinism occurs in programs running on intel with floats when threads are interrupted and the math coprocessor flushed. The non-determinism isn't intrinsic to the floating point arithmetic but to the non-determinism of when this truncation may occur. This is more relevant for fields where chaotic dynamics occur. So the same program with the same inputs can produce different results.
NaN is an error. If you take the square root of a negative number you get a NaN. This is just a type error, use complex numbers to overcome this one. But then you get 0. / 0. and that's a NaN or Inf - Inf and a whole slew of other things that produce out of bounds results. Whether it is expected or not is another story, but it does mean that you are unable to represent the value with a float and that is a type error.
> Non-determinism occurs in programs running on intel with floats when threads are interrupted and the math coprocessor flushed
That's ridiculous. No OS in his right mind would flush FPU regs to 64 bits only, because that would break many things, most obviously "real" 80 bit FP which is still a thing and the only reason x87 instructions still work. It would even break plain equality comparisons making all FP useless.
For 64 bit FP most compilers prefer SSE rather than x87 instructions these days.
NaN is not necessarily an error. It might be fine. It depends on what you're doing with it.
If NaN is invalid input for the next step, then sure why not treat it as an error? But that's a design decision not an imperative that everybody must follow. (I picture Mel Brooks' 15 commandments, #11-15, that fell and broke. This is not like that.)
Sure, not every runtime type error needs to panic your application, nor even panic a request handler, nor even result in a failed handling. That doesn't mean you didn't encounter an error though. Error doesn't mean fatal.
The design decision isn't, "Was this a type error?" but, "What do I need to do about this type error?"
Yes it is totally fine. I've seen code basically treating NaN as missing data as opposed to std::optional<double> or equivalent in your language. NaN propagates so this works like using such types as a monad.
Wow, you're crossing a few wires in your zeal to provide information to the point that you're repeating myths.
> The intel processors have a separate math coprocessor that supports 80bit floats
x86 processors have two FPU units, the x87 unit (that you're describing) and the SSE unit. Anyone compiling for x86-64 uses the SSE unit for default, and most x86-32 compilers still default to SSE anyways.
> Moving a float from a register in this coprocessor to memory truncates the float.
No it doesn't. The x87 unit has load and store instructions for 32-bit, 64-bit, and 80-bit floats. If you want to spill 80-bit values as 80-bit values, you can do so.
> Repeated math can be done inside this coprocessor to achieve higher precision so hot loops generally don't move floats outside of these registers.
Hot loops these days use the SSE stuff because they're so much faster than x87. Friends don't let friends use long double without good reason!
> Non-determinism occurs in programs running on intel with floats when threads are interrupted and the math coprocessor flushed.
Lol, nope. You'll spill the x87 register stack on thread context switch with FSAVE or FXSAVE or XSAVE, all of which will store the registers as 80-bit values without loss of precision.
That said, there was a problem with programs that use the x87 unit, but it has absolutely nothing to do with what you're describing. The x87 unit doesn't have arithmetic for 32-bit and 64-bit values, only 80-bit values. Many compilers, though, just pretended that the x87 unit supported arithmetic on 32-bit and 64-bit values, so that FADD would simultaneously be a 32-bit addition, a 64-bit addition, and a 80-bit addition. If the compiler needed to spill a floating-point register, they would spill the value as a 32-bit value (if float) or 64-bit value (if double), and register spills are pretty unpredictable for user code. That's the nondeterminism you're referring to, and it's considered a bug in every compiler I'm aware of. (See https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p37... for a more thorough description of the problem).
It isn't zeal, it's 15 years past hazy memory of getting different results on different executions in the same supercomputer. The story that went around was the one I relayed, but certainly your link does a better job explaining things that happen in the user perspective section:
I like these. They are push back against the sort of… first correction that people make when encountering floating point weirdness. That is, the first mistake we make is to treat floats as reals, and then we observe some odd rounding behavior. The second mistake we make is to treat the rounding events as random. A nice thing about IEEE floats is that the rounding behavior is well defined.
Often it doesn't matter, like you ask for a gemm and you get whatever order of operations blas, AVX-whatever, and OpenMP conspire to give you, so it is more-or-less random.
But if it does matter, the ability to define it is there.
> A nice thing about IEEE floats is that the rounding behavior is well defined.
Until it isn't. I used to play the CodeWeavers port of Kohan and while the game would allow you to do crossplay between Windows and Linux, differences in how the two OSes rounded floats would cause the game to desynchronize after 15-20 minutes of play or so. Some unit's pathfinding algorithm would zig on Windows and zag on Linux, causing the state to diverge and end up kicking off one of the players.
It is good to enable that flag because it also enables the “fun safe math optimizations” flag, and it is important to remind people that math is a safe way to have fun.
The differences are almost certainly not in how the two OSes rounded floats--the IEEE rounding modes are standard, and almost no one actually bothers to even change the rounding mode from the default.
For cross-OS issues, the most likely culprit is that Windows and Linux are using different libm implementations, which means that the results of functions like sin or atan2 are going to be slightly different.
When writing code, it's useful to follow simpler rules than strictly needed to make it easier to read for other coders (including you in the future). For example, you might add some unnecessary brackets to expressions to avoid remembering all the operator precedence rules.
The article is right that sometimes you can can use simple equality on floating point values. But if you have a (almost) blanket rule to use fuzzy comparison, or (even better) just avoid any sort if equality-like comparison altogether, then your code might be simpler to understand and you might be less likely to make a mistake about when you really can safely do that.
It's still sensible to understand that it's possible though. Just that you should try to avoid writing tricky code that depends on it.
That sort of argument can be problematic though, because the code can become misleading. If I see a fuzzy comparison then I'd assume it is there because it is needed, which might make the rest of the code more difficult to understand/modify because I then have to assume that everywhere else the values might be fuzzy.
Note: the improved loop in the "1. They are not exact" can easily hang.
If count > 2^24, then the ulp of f is 2, and adding 1.0f leaves f unchanged.
What's wild is that a few lines later he notes how above 2^24 numbers start “jumping” every 2. Ok, but FP are always "jumping", by their ulp, regardless of how big or small they are.
I've never heard anyone say any of these supposed myths, except for the first one, sort of, but nobody means what the first one pretends it means, so this whole post feels like a big strawman to me.
Just to add some context Adam works at AMD as a dev tech, so he is constantly working with game studios developers directly. While I can't say I've heard the same things since I'm somewhere else, I have seen some of the assumptions made in some shader code and they do line up with the kind of things he's saying.
People do say a lot of nonsense about floats, my younger self trying to bash JavaScript included. They e.g. do fine as integers - up to 2**53; this could be optimised by a JIT to use actual integer math.
I was putting together some technical interview questions for our candidates and I wanted to see what ChatGPT would put for an answer. It told me floating point numbers were non-deterministic and wouldn’t back down from that answer so it’s getting it from somewhere.
I concur, it seems low effort, and the only real common "myth" (the 1st one) is not really disproven. Infact the very example he puts goes to prove it, as it is going to become an infinite loop given large enough N....
Also compiler optimizations should not affect the result. Specially without "fast-math/O3" (which arguably many people stupidly use nowadays, then complain).
> Also compiler optimizations should not affect the result. Specially without "fast-math/O3" (which arguably many people stupidly use nowadays, then complain).
-ffp-contract is annoying exception to that principle.
The first rule is true, but relying on it is dangerous unless you are well versed in what floats can and can not be represented exactly. It's best to pretend it isn't true in most cases to avoid footguns.
People need to understand rounding better, especially the topic of when rounding can happen and when it can't for the basic operations.
Updating with a concrete example: The Fortran standard defines the real MOD and MODULO intrinsic functions as being equivalent to the straightforward sequence of a division, conversion to integer, multiplication, and subtraction. This formula can round, obviously. But MOD can be (and thus should be) implemented exactly by other means, and most Fortran compilers do so instead. This leaves Fortran implementors in a bit of a pickle -- conform to the standard, or produce good results?
Definitely. People think they'll get out of knowing how rounding works by using arbitrary precision arithmetic, but arguably it's even more important there (you run out of precision/memory at some point; what do you think happens then?). You can use floats for money if you do the rounding right.
My favorite trick: NaN boxing. NaN's aren't just for errors, but also for smuggling other data inside. For a double, you have whopping 53 bits of payload, enough to cram in a pointer and maybe a type tag, and many javascript engines do (since JS numbers are double's after all)
It is also how RISC-V floating point registers are required to store floats of smaller widths. Eg if your CPU supports 64-bit floats (D extension), its FPU registers will be 64-bit wide. If you use an instruction to load a 16-bit float (Zfh extension) into such a register, it will be boxed into a negative quiet NaN with all bits above the lower 16 bits set to 1.
You can put stuff into the sign bit too, that makes 53. Yeah, the lower 52 bits can't all be zero - that'd be ±INF, but the other 2^53-2 values are all yours to use.
It's possible for the sign bit of a NaN to be changed by a "non-arithmetic" operation that doesn't trap on the NaN, so don't put anything precious in there.
Obvious: An array containing NaNs will almost certainly be sorted incorrectly with std::sort.
Non-obvious: sorting an array containing NaNs with std::sort from libstdc++ (gcc's implementation) could lead to a buffer overflow.
Non-obvious: using MMX instructions without clearing the FPU state leads to incorrect results of floating-point instructions in other parts of the code base: https://github.com/simdjson/simdjson/issues/169
It's not exactly a myth, as the article mentions, they're only exact for certain ranges of values.
> NaN and INF are indication of an error
This is somewhat semantic, but dividing by zero typically does create a hardware exception. However, it's all handled behind the scenes, and you get "Inf" as the result.
You can make it so dividing by zero is explicitly an error, see the "ftrapping-math" flag.
I think the myth about exactness is that you can't use strict equality with floating point numbers because they are somehow fuzzy. They are not.
Some operations involve rounding, notably conversion from decimal, but the (rounded) result is an exact number that can be stored and recovered without loss of precision and equality will work as expected. Converting from floating point binary to decimal is exact though, given enough digits (can be >1000 for very small or very large numbers).
Here's one that's not a myth: IEEE-754 floats are the only "primitive types" that allow "a == a" to not be true.
I.e., two floats that are _identical_ to each other (even when it's _the same_ variable, on the same memory address) can be not _equal_ to each other, specifically if it's NaN. This is dictated by IEEE-754, and this is true for all programming languages I know, and to this day, this makes zero sense to me, but apparently this is useful for some reason.
It lets you easily test whether a value is NaN without needing a library function call. (Even if the language wanted to provide a NaN literal, there's more than one NaN value, so that wouldn't actually work!)
>>> A beginner would use them, trusting they are infinitely capable and precise, which can lead to problems. An intermediate programmer knows that they have some limitations, and so by using some good practices the problems can be avoided.
Amusingly, pocket calculators became affordable while I was in high school, and anybody who was interested in math learned the foibles of floating point almost instantly. Now it's an advanced topic. A difference may have been that our calculators had very few significant digits, so the errors introduced by successive calculations became noticeable more quickly. Also, you had to think about what you were doing because, even if you trusted the calculator, you didn't trust your fingers.
Modern pocket calculators now use some very fancy math that makes them have much less error than if they used floating point. Many of them will use arbitrary-precision arithmetic at minimum, and there are some fancier schemes.
In the context of double precision the article says
> the largest integer value that can be represented exactly is 2^53
— I am confused as to why it not 2^52, given that there are 52 bits of mantissa, so relative accuracy is 2^-52, which translates to absolute accuracy larger than 1 after 2^52. Compare this to the table there saying "Next value after 1 = 1 + 2^-52".
The core thing to know about floating point numbers in comp langs is they aren't floating point numbers.
They are approximations of floating point numbers, and even if your current approximation of a floating point number to represent a value seems to be accurate as an integer vale...
There is no guarantee if you take two of those floating point number approximations that appear completely accurate that the resulting operation between them will also be completely accurate.
That's not a useful way to think. Floating point numbers are just floating point numbers. You aren't approximating floating point numbers. You are approximating real numbers.
Floating point numbers are compared against fixed point numbers where the point (the part between the integer part and fractional part) is fixed. That is why they are called floating. The nature of the real numbers is such that they can in general only be approximated, regardless of whether you use fixed point numbers or floating point numbers or a fancier computable number representation.
The real numbers can be "represented" and "computed with" without needing approximations. The result is that there are no rounding errors. But it's not popular, and not necessarily worth the tradeoffs. On the other hand, see the Android calculator.
The Android calculator doesn't support all real numbers. It supports a specific subset of computable numbers that are computable using the operations presented in the calculator app. It's a good domain-specific number representation. It's far from real numbers.
We can't even represent all the natural numbers: that would require infinite memory. Now you want to represent the real numbers, equinumerous with the power set of natural numbers?
We can represent all real numbers to the same extent that we can represent all natural numbers. They're just infinite strings, with each additional letter in the string mattering less and less as it goes along. The "Type Two Effectivity" model for modelling computations with infinite strings doesn't limit you to using just the computable real numbers. An uncomputable real number can be produced by outputting an infinite string letter by letter with each letter getting picked at random. The TTE model does OTOH limit you to computing only with the computable functions on those real numbers.
The TTE model basically uses (Python-like) generators to output an endless sequence of interval approximations to a real number.
> An uncomputable real number can be produced by outputting an infinite string letter by letter with each letter getting picked at random.
This certainly outputs an uncomputable real number, given that you have true random input and not a PRNG.
But random numbers from a seed and an algorithm are by definition computable.
And if you can't compute the number, you have no basis for the claim that it's the right number.
Also, good luck computing with arbitrarily long, computed-on-demand digit strings. Consider: how do you know how many digits you need from the input in order to be sure of N correct digits in the output? This might be possible in general, but it certainly doesn't sound like my idea of fun.
My favorite thing about floating point numbers: you can divide by zero. The result of x/0.0 is +/- inf (or NaN if x is zero). There's a helpful table in "weird floats" [0] that covers all the cases for division and a bunch of other arithmetic instructions.
This is especially useful when writing branchless or SIMD code. Adding a branch for checking against zero can have bad performance implications and it isn't even necessary in many cases.
Especially in graphics code I often see a zero check before a division and a fallback for the zero case. This often practically means "wait until numerical precision artifacts arise and then do something else". Often you could just choose the better of the two options you have instead of checking for zero.
Case in point: choosing the axes of your shadow map projection matrix. You have two options (world x axis or z axis), choose the better one (larger angle with viewing direction). Don't wait until the division goes to inf and then fall back to the other.
[0] https://www.cs.uaf.edu/2011/fall/cs301/lecture/11_09_weird_f...
> you can divide by zero
It is implementation-dependent. It is not obligatory for implementation to respect IEEE 754.
Floating-point arithmetics adopted this from the extended reals (usually denoted as ℝ̅): https://en.wikipedia.org/wiki/Extended_real_number_line (see Arithmetic operations there)
This is also my favorite thing about floating point numbers. Unfortunately languages like Python try to be smart and prevent me from doing it. Compare:
I'm so used to writing such zero division in other languages like C/C++ that this Python quirk still trips me up.Division by zero is an error and it should be treated as such. "Infinity" is an error indication from overflow and division by zero, nothing more.
This is totally false mathematically. Please look up the extended real number system for an example. Many branches of mathematics affix infinity to some existing number system, extending its operations consistently, and do all kinds of useful things with this setup. Being able to work with infinity in exactly the same way in IEEE754 is crucial for being able to cleanly map algorithms from these domains onto a computer. If dividing by zero were an error in floating point arithmetic, I would be unable to do my job developing numerical methods.
The article's point 3 says that this is a myth. Indeed, the _limit_ of `1/x` as `x` approaches zero from the right is positive infinity. What's more, division by _negative zero_ (which, perhaps surprisingly, is a thing) yields negative infinity, which is also the value of the corresponding limit. If you divide a finite float by infinity, you get zero, because `lim_{x\to\infty} c/x=0`. In many cases you can treat division by zero or infinity as the appropriate limit.
I am allowed to disagree with the article.
Sure, but it makes sense, doesn't it? Even `inf-inf == NaN` and `inf/inf == NaN`, which is true in calculus: limits like these are undefined, unless you use l'Hôpital's rule or something. (I know NaN isn't equal to itself, it's just for illustration purposes) But then again, you usually don't want these popping up in your code.
In practice, though, I can't recall any HPC codes that want to use IEEE-754 infinities as valid data.
You still can't divide by zero, it just doesn't result in an error state that stops execution. The inf and NaN values are sentinel values that you still have to check for after the calculation to know if it went awry.
In the space of floats, you are dividing by zero. To map back to the space of numbers you have to check. It's nice, though; inf and NaN sentinels give you the behavior of a monadic `Result | Error` pipeline without having to wrap your numbers in another abstraction.
If dividing by zero has a well defined result that doesn’t abort execution what exactly does “can’t” even mean?
Operations on those sentinel values are also defined. This can affect when checking needs to be done in optimized code.
I believe divide-by-zero produces an exception. The machine can either be configured to mask that exception, or not.
Personally, I am lazy, so I don’t check the mxcsr register before I start running my programs. Maybe gcc does something by default, I don’t know. IMO legitimate division by zero is rare but not impossible, so if you do it, the onus is on you to make sure the flags are set up right.
Correct, divide by zero is one of the original five defined IEEE754-1985 exception. But the default behavior then and now is to produce that defined result mentioned and continue execution with a flag set ("default non-stop"). Further conforming implementations also allow "raiseNoFlag".
It's well-defined is all that really matters AFAIC.
This is the Result monad in practice. It allows you to postpone error handling until the computation is done.
It has always amused me that Integer division by 0 results in "floating point exception", but floating point division by 0.0 doesn't!
You can divide by zero, but you mayn't.
You can exploit the exactness of (specific) floating point operations in test data by using sums of powers of 2. Polynomials with such coefficients produce exact results so long as the overall powers are within ~53 powers of 2 (don't quote me exactly on that, I generally don't push the range very high!). You can find exact polynomial solutions to linear PDEs with such powers using high enough order finite difference methods for example.
However, the story about non-determinism is no myth. The intel processors have a separate math coprocessor that supports 80bit floats (https://en.wikipedia.org/wiki/Extended_precision#x86_extende...). Moving a float from a register in this coprocessor to memory truncates the float. Repeated math can be done inside this coprocessor to achieve higher precision so hot loops generally don't move floats outside of these registers. Non-determinism occurs in programs running on intel with floats when threads are interrupted and the math coprocessor flushed. The non-determinism isn't intrinsic to the floating point arithmetic but to the non-determinism of when this truncation may occur. This is more relevant for fields where chaotic dynamics occur. So the same program with the same inputs can produce different results.
NaN is an error. If you take the square root of a negative number you get a NaN. This is just a type error, use complex numbers to overcome this one. But then you get 0. / 0. and that's a NaN or Inf - Inf and a whole slew of other things that produce out of bounds results. Whether it is expected or not is another story, but it does mean that you are unable to represent the value with a float and that is a type error.
> Non-determinism occurs in programs running on intel with floats when threads are interrupted and the math coprocessor flushed
That's ridiculous. No OS in his right mind would flush FPU regs to 64 bits only, because that would break many things, most obviously "real" 80 bit FP which is still a thing and the only reason x87 instructions still work. It would even break plain equality comparisons making all FP useless.
For 64 bit FP most compilers prefer SSE rather than x87 instructions these days.
https://bugs.php.net/bug.php?id=53632
Never, for sure.
NaN is not necessarily an error. It might be fine. It depends on what you're doing with it.
If NaN is invalid input for the next step, then sure why not treat it as an error? But that's a design decision not an imperative that everybody must follow. (I picture Mel Brooks' 15 commandments, #11-15, that fell and broke. This is not like that.)
Sure, not every runtime type error needs to panic your application, nor even panic a request handler, nor even result in a failed handling. That doesn't mean you didn't encounter an error though. Error doesn't mean fatal.
The design decision isn't, "Was this a type error?" but, "What do I need to do about this type error?"
Yes it is totally fine. I've seen code basically treating NaN as missing data as opposed to std::optional<double> or equivalent in your language. NaN propagates so this works like using such types as a monad.
Wow, you're crossing a few wires in your zeal to provide information to the point that you're repeating myths.
> The intel processors have a separate math coprocessor that supports 80bit floats
x86 processors have two FPU units, the x87 unit (that you're describing) and the SSE unit. Anyone compiling for x86-64 uses the SSE unit for default, and most x86-32 compilers still default to SSE anyways.
> Moving a float from a register in this coprocessor to memory truncates the float.
No it doesn't. The x87 unit has load and store instructions for 32-bit, 64-bit, and 80-bit floats. If you want to spill 80-bit values as 80-bit values, you can do so.
> Repeated math can be done inside this coprocessor to achieve higher precision so hot loops generally don't move floats outside of these registers.
Hot loops these days use the SSE stuff because they're so much faster than x87. Friends don't let friends use long double without good reason!
> Non-determinism occurs in programs running on intel with floats when threads are interrupted and the math coprocessor flushed.
Lol, nope. You'll spill the x87 register stack on thread context switch with FSAVE or FXSAVE or XSAVE, all of which will store the registers as 80-bit values without loss of precision.
That said, there was a problem with programs that use the x87 unit, but it has absolutely nothing to do with what you're describing. The x87 unit doesn't have arithmetic for 32-bit and 64-bit values, only 80-bit values. Many compilers, though, just pretended that the x87 unit supported arithmetic on 32-bit and 64-bit values, so that FADD would simultaneously be a 32-bit addition, a 64-bit addition, and a 80-bit addition. If the compiler needed to spill a floating-point register, they would spill the value as a 32-bit value (if float) or 64-bit value (if double), and register spills are pretty unpredictable for user code. That's the nondeterminism you're referring to, and it's considered a bug in every compiler I'm aware of. (See https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p37... for a more thorough description of the problem).
It isn't zeal, it's 15 years past hazy memory of getting different results on different executions in the same supercomputer. The story that went around was the one I relayed, but certainly your link does a better job explaining things that happen in the user perspective section:
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p37...
Summarized as,
> Most users cannot be expected to know all of the ways that their floating-point code is not reproducible.
Glad to know that the situation is defaulting to SSE2 nowadays though.
> Non-determinism occurs in programs running on intel
FTFY. They even changed some of the more obscure handling between 8087,80287,80387. So much hoop jumping if you cared about binary reproducibility.
Seems to be largely fixed with targeting SSE even for scalar code now.
I like these. They are push back against the sort of… first correction that people make when encountering floating point weirdness. That is, the first mistake we make is to treat floats as reals, and then we observe some odd rounding behavior. The second mistake we make is to treat the rounding events as random. A nice thing about IEEE floats is that the rounding behavior is well defined.
Often it doesn't matter, like you ask for a gemm and you get whatever order of operations blas, AVX-whatever, and OpenMP conspire to give you, so it is more-or-less random.
But if it does matter, the ability to define it is there.
On the other hand, it feels wrong to me to call these "myths" when they are really just simplifications.
(And I have heard of the 80-bit internal register thing described in https://news.ycombinator.com/item?id=44888692 causing real problems for people before. And --ffast-math is basically spooky action at a distance considering how it bleeds into the entire program; see e.g. https://moyix.blogspot.com/2022/09/someones-been-messing-wit....)
> A nice thing about IEEE floats is that the rounding behavior is well defined.
Until it isn't. I used to play the CodeWeavers port of Kohan and while the game would allow you to do crossplay between Windows and Linux, differences in how the two OSes rounded floats would cause the game to desynchronize after 15-20 minutes of play or so. Some unit's pathfinding algorithm would zig on Windows and zag on Linux, causing the state to diverge and end up kicking off one of the players.
Is it possible that your different operating systems just had different mxcsr values?
Or, since it was a port, maybe they were compiled with different optimizations.
There are a lot of things happening under the hood but most of them should be deterministic.
until someone compiles with --ffast-math enabled, stating "I don't care about accuracy, as long as it's fast".
It is good to enable that flag because it also enables the “fun safe math optimizations” flag, and it is important to remind people that math is a safe way to have fun.
"Friends don't let friends use fast-math"
https://simonbyrne.github.io/notes/fastmath/
The differences are almost certainly not in how the two OSes rounded floats--the IEEE rounding modes are standard, and almost no one actually bothers to even change the rounding mode from the default.
For cross-OS issues, the most likely culprit is that Windows and Linux are using different libm implementations, which means that the results of functions like sin or atan2 are going to be slightly different.
Problem with rounding modes and other fpenv flags is that any library anywhere might flip some flag and suddenly the whole program changes behavior.
When writing code, it's useful to follow simpler rules than strictly needed to make it easier to read for other coders (including you in the future). For example, you might add some unnecessary brackets to expressions to avoid remembering all the operator precedence rules.
The article is right that sometimes you can can use simple equality on floating point values. But if you have a (almost) blanket rule to use fuzzy comparison, or (even better) just avoid any sort if equality-like comparison altogether, then your code might be simpler to understand and you might be less likely to make a mistake about when you really can safely do that.
It's still sensible to understand that it's possible though. Just that you should try to avoid writing tricky code that depends on it.
Every time I use exact floating point equality I simply write a comment explaining why I did this. That's what comments are for.
That sort of argument can be problematic though, because the code can become misleading. If I see a fuzzy comparison then I'd assume it is there because it is needed, which might make the rest of the code more difficult to understand/modify because I then have to assume that everywhere else the values might be fuzzy.
Note: the improved loop in the "1. They are not exact" can easily hang. If count > 2^24, then the ulp of f is 2, and adding 1.0f leaves f unchanged. What's wild is that a few lines later he notes how above 2^24 numbers start “jumping” every 2. Ok, but FP are always "jumping", by their ulp, regardless of how big or small they are.
I've never heard anyone say any of these supposed myths, except for the first one, sort of, but nobody means what the first one pretends it means, so this whole post feels like a big strawman to me.
Just to add some context Adam works at AMD as a dev tech, so he is constantly working with game studios developers directly. While I can't say I've heard the same things since I'm somewhere else, I have seen some of the assumptions made in some shader code and they do line up with the kind of things he's saying.
People do say a lot of nonsense about floats, my younger self trying to bash JavaScript included. They e.g. do fine as integers - up to 2**53; this could be optimised by a JIT to use actual integer math.
I was putting together some technical interview questions for our candidates and I wanted to see what ChatGPT would put for an answer. It told me floating point numbers were non-deterministic and wouldn’t back down from that answer so it’s getting it from somewhere.
I concur, it seems low effort, and the only real common "myth" (the 1st one) is not really disproven. Infact the very example he puts goes to prove it, as it is going to become an infinite loop given large enough N....
Also compiler optimizations should not affect the result. Specially without "fast-math/O3" (which arguably many people stupidly use nowadays, then complain).
> Also compiler optimizations should not affect the result. Specially without "fast-math/O3" (which arguably many people stupidly use nowadays, then complain).
-ffp-contract is annoying exception to that principle.
The first rule is true, but relying on it is dangerous unless you are well versed in what floats can and can not be represented exactly. It's best to pretend it isn't true in most cases to avoid footguns.
Same, but I still learned a fair amount anyways. I say no harm, no foul.
People need to understand rounding better, especially the topic of when rounding can happen and when it can't for the basic operations.
Updating with a concrete example: The Fortran standard defines the real MOD and MODULO intrinsic functions as being equivalent to the straightforward sequence of a division, conversion to integer, multiplication, and subtraction. This formula can round, obviously. But MOD can be (and thus should be) implemented exactly by other means, and most Fortran compilers do so instead. This leaves Fortran implementors in a bit of a pickle -- conform to the standard, or produce good results?
Definitely. People think they'll get out of knowing how rounding works by using arbitrary precision arithmetic, but arguably it's even more important there (you run out of precision/memory at some point; what do you think happens then?). You can use floats for money if you do the rounding right.
A famous old interview question regarding floating points: https://stackoverflow.com/questions/6699066/in-which-order-s...
This awesome article drives home the point very succinctly: https://fabiensanglard.net/floating_point_visually_explained...
Related: What Every Computer Scientist Should Know About Floating-Point Arithmetic <https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.h...>
Also related:
https://floating-point-gui.de/
My list includes: https://0.30000000000000004.com/
My favorite trick: NaN boxing. NaN's aren't just for errors, but also for smuggling other data inside. For a double, you have whopping 53 bits of payload, enough to cram in a pointer and maybe a type tag, and many javascript engines do (since JS numbers are double's after all)
https://wingolog.org/archives/2011/05/18/value-representatio...
https://piotrduperas.com/posts/nan-boxing
It is also how RISC-V floating point registers are required to store floats of smaller widths. Eg if your CPU supports 64-bit floats (D extension), its FPU registers will be 64-bit wide. If you use an instruction to load a 16-bit float (Zfh extension) into such a register, it will be boxed into a negative quiet NaN with all bits above the lower 16 bits set to 1.
52 bits of payload, and at least one bit must be set.
You can put stuff into the sign bit too, that makes 53. Yeah, the lower 52 bits can't all be zero - that'd be ±INF, but the other 2^53-2 values are all yours to use.
It's possible for the sign bit of a NaN to be changed by a "non-arithmetic" operation that doesn't trap on the NaN, so don't put anything precious in there.
This is a great article!
A few more resources on this topic:
https://randomascii.wordpress.com/category/floating-point/ - a series of articles about floating point, with deeper details.
https://float.exposed/ - explorer of floating point values by Bartosz Ciechanowski.
Fun facts about floating point, from my practice:
Obvious: An array containing NaNs will almost certainly be sorted incorrectly with std::sort.
Non-obvious: sorting an array containing NaNs with std::sort from libstdc++ (gcc's implementation) could lead to a buffer overflow.
Non-obvious: using MMX instructions without clearing the FPU state leads to incorrect results of floating-point instructions in other parts of the code base: https://github.com/simdjson/simdjson/issues/169
> They are not exact
It's not exactly a myth, as the article mentions, they're only exact for certain ranges of values.
> NaN and INF are indication of an error
This is somewhat semantic, but dividing by zero typically does create a hardware exception. However, it's all handled behind the scenes, and you get "Inf" as the result.
You can make it so dividing by zero is explicitly an error, see the "ftrapping-math" flag.
I think the myth about exactness is that you can't use strict equality with floating point numbers because they are somehow fuzzy. They are not.
Some operations involve rounding, notably conversion from decimal, but the (rounded) result is an exact number that can be stored and recovered without loss of precision and equality will work as expected. Converting from floating point binary to decimal is exact though, given enough digits (can be >1000 for very small or very large numbers).
Archive of the defunct tweet link: https://web.archive.org/web/20210608061416/https://twitter.c...
Here's one that's not a myth: IEEE-754 floats are the only "primitive types" that allow "a == a" to not be true.
I.e., two floats that are _identical_ to each other (even when it's _the same_ variable, on the same memory address) can be not _equal_ to each other, specifically if it's NaN. This is dictated by IEEE-754, and this is true for all programming languages I know, and to this day, this makes zero sense to me, but apparently this is useful for some reason.
>but apparently this is useful for some reason.
It lets you easily test whether a value is NaN without needing a library function call. (Even if the language wanted to provide a NaN literal, there's more than one NaN value, so that wouldn't actually work!)
>>> A beginner would use them, trusting they are infinitely capable and precise, which can lead to problems. An intermediate programmer knows that they have some limitations, and so by using some good practices the problems can be avoided.
Amusingly, pocket calculators became affordable while I was in high school, and anybody who was interested in math learned the foibles of floating point almost instantly. Now it's an advanced topic. A difference may have been that our calculators had very few significant digits, so the errors introduced by successive calculations became noticeable more quickly. Also, you had to think about what you were doing because, even if you trusted the calculator, you didn't trust your fingers.
Modern pocket calculators now use some very fancy math that makes them have much less error than if they used floating point. Many of them will use arbitrary-precision arithmetic at minimum, and there are some fancier schemes.
In the context of double precision the article says
> the largest integer value that can be represented exactly is 2^53
— I am confused as to why it not 2^52, given that there are 52 bits of mantissa, so relative accuracy is 2^-52, which translates to absolute accuracy larger than 1 after 2^52. Compare this to the table there saying "Next value after 1 = 1 + 2^-52".
There's an implied one bit, so you actually have a 53 bit significand (and 53-bit precision) given only a 52 bit mantissa.
Right, I did realize after posting that close to numbers of the form
1{hidden bit} + (1-2^-52){mantissa with all ones}
the relative accuracy — corresponding to the absolute accuracy of a single bit in mantissa — is about 2^-53. The hidden bit is easy to forget about...
https://stackoverflow.com/questions/1848700/biggest-integer-...
The core thing to know about floating point numbers in comp langs is they aren't floating point numbers.
They are approximations of floating point numbers, and even if your current approximation of a floating point number to represent a value seems to be accurate as an integer vale...
There is no guarantee if you take two of those floating point number approximations that appear completely accurate that the resulting operation between them will also be completely accurate.
That's not a useful way to think. Floating point numbers are just floating point numbers. You aren't approximating floating point numbers. You are approximating real numbers.
Floating point numbers are compared against fixed point numbers where the point (the part between the integer part and fractional part) is fixed. That is why they are called floating. The nature of the real numbers is such that they can in general only be approximated, regardless of whether you use fixed point numbers or floating point numbers or a fancier computable number representation.
The real numbers can be "represented" and "computed with" without needing approximations. The result is that there are no rounding errors. But it's not popular, and not necessarily worth the tradeoffs. On the other hand, see the Android calculator.
The Android calculator doesn't support all real numbers. It supports a specific subset of computable numbers that are computable using the operations presented in the calculator app. It's a good domain-specific number representation. It's far from real numbers.
We can't even represent all the natural numbers: that would require infinite memory. Now you want to represent the real numbers, equinumerous with the power set of natural numbers?
> We can't even represent all the natural numbers
We can represent all real numbers to the same extent that we can represent all natural numbers. They're just infinite strings, with each additional letter in the string mattering less and less as it goes along. The "Type Two Effectivity" model for modelling computations with infinite strings doesn't limit you to using just the computable real numbers. An uncomputable real number can be produced by outputting an infinite string letter by letter with each letter getting picked at random. The TTE model does OTOH limit you to computing only with the computable functions on those real numbers.
The TTE model basically uses (Python-like) generators to output an endless sequence of interval approximations to a real number.
Georg Cantor would like a word.
Representable numbers are countable, they have measure zero in the reals.
> An uncomputable real number can be produced by outputting an infinite string letter by letter with each letter getting picked at random.
This certainly outputs an uncomputable real number, given that you have true random input and not a PRNG.
But random numbers from a seed and an algorithm are by definition computable.
And if you can't compute the number, you have no basis for the claim that it's the right number.
Also, good luck computing with arbitrarily long, computed-on-demand digit strings. Consider: how do you know how many digits you need from the input in order to be sure of N correct digits in the output? This might be possible in general, but it certainly doesn't sound like my idea of fun.