Myths About Floating-Point Numbers (2021)

(asawicki.info)

56 points | by Bogdanp 4 days ago ago

79 comments

exDM69 18 hours ago

My favorite thing about floating point numbers: you can divide by zero. The result of x/0.0 is +/- inf (or NaN if x is zero). There's a helpful table in "weird floats" [0] that covers all the cases for division and a bunch of other arithmetic instructions.

This is especially useful when writing branchless or SIMD code. Adding a branch for checking against zero can have bad performance implications and it isn't even necessary in many cases.

Especially in graphics code I often see a zero check before a division and a fallback for the zero case. This often practically means "wait until numerical precision artifacts arise and then do something else". Often you could just choose the better of the two options you have instead of checking for zero.

Case in point: choosing the axes of your shadow map projection matrix. You have two options (world x axis or z axis), choose the better one (larger angle with viewing direction). Don't wait until the division goes to inf and then fall back to the other.

[0] https://www.cs.uaf.edu/2011/fall/cs301/lecture/11_09_weird_f...

[-]

wolvesechoes an hour ago

> you can divide by zero

It is implementation-dependent. It is not obligatory for implementation to respect IEEE 754.

layer8 6 hours ago

Floating-point arithmetics adopted this from the extended reals (usually denoted as ℝ̅): https://en.wikipedia.org/wiki/Extended_real_number_line (see Arithmetic operations there)

kccqzy 15 hours ago

This is also my favorite thing about floating point numbers. Unfortunately languages like Python try to be smart and prevent me from doing it. Compare:

    >>> 1.0/0.0
    ZeroDivisionError
    >>> np.float64(1)/np.float64(0)
    inf

I'm so used to writing such zero division in other languages like C/C++ that this Python quirk still trips me up.

[-]

pklausler 14 hours ago

Division by zero is an error and it should be treated as such. "Infinity" is an error indication from overflow and division by zero, nothing more.

[-]

sfpotter 9 hours ago

This is totally false mathematically. Please look up the extended real number system for an example. Many branches of mathematics affix infinity to some existing number system, extending its operations consistently, and do all kinds of useful things with this setup. Being able to work with infinity in exactly the same way in IEEE754 is crucial for being able to cleanly map algorithms from these domains onto a computer. If dividing by zero were an error in floating point arithmetic, I would be unable to do my job developing numerical methods.

ForceBru 14 hours ago

The article's point 3 says that this is a myth. Indeed, the _limit_ of `1/x` as `x` approaches zero from the right is positive infinity. What's more, division by _negative zero_ (which, perhaps surprisingly, is a thing) yields negative infinity, which is also the value of the corresponding limit. If you divide a finite float by infinity, you get zero, because `lim_{x\to\infty} c/x=0`. In many cases you can treat division by zero or infinity as the appropriate limit.

[-]

pklausler 14 hours ago

I am allowed to disagree with the article.

[-]

ForceBru 14 hours ago

Sure, but it makes sense, doesn't it? Even `inf-inf == NaN` and `inf/inf == NaN`, which is true in calculus: limits like these are undefined, unless you use l'Hôpital's rule or something. (I know NaN isn't equal to itself, it's just for illustration purposes) But then again, you usually don't want these popping up in your code.

[-]

pklausler 14 hours ago

In practice, though, I can't recall any HPC codes that want to use IEEE-754 infinities as valid data.

thwarted 17 hours ago

You still can't divide by zero, it just doesn't result in an error state that stops execution. The inf and NaN values are sentinel values that you still have to check for after the calculation to know if it went awry.

[-]

sixo 17 hours ago

In the space of floats, you are dividing by zero. To map back to the space of numbers you have to check. It's nice, though; inf and NaN sentinels give you the behavior of a monadic `Result | Error` pipeline without having to wrap your numbers in another abstraction.

epcoa 17 hours ago

If dividing by zero has a well defined result that doesn’t abort execution what exactly does “can’t” even mean?

Operations on those sentinel values are also defined. This can affect when checking needs to be done in optimized code.

[-]

bee_rider 14 hours ago

I believe divide-by-zero produces an exception. The machine can either be configured to mask that exception, or not.

Personally, I am lazy, so I don’t check the mxcsr register before I start running my programs. Maybe gcc does something by default, I don’t know. IMO legitimate division by zero is rare but not impossible, so if you do it, the onus is on you to make sure the flags are set up right.

[-]

epcoa 13 hours ago

Correct, divide by zero is one of the original five defined IEEE754-1985 exception. But the default behavior then and now is to produce that defined result mentioned and continue execution with a flag set ("default non-stop"). Further conforming implementations also allow "raiseNoFlag".

It's well-defined is all that really matters AFAIC.

mike_ivanov 15 hours ago

This is the Result monad in practice. It allows you to postpone error handling until the computation is done.

wpollock 15 hours ago

It has always amused me that Integer division by 0 results in "floating point exception", but floating point division by 0.0 doesn't!

TehShrike 17 hours ago

You can divide by zero, but you mayn't.

BlackFly 18 hours ago

You can exploit the exactness of (specific) floating point operations in test data by using sums of powers of 2. Polynomials with such coefficients produce exact results so long as the overall powers are within ~53 powers of 2 (don't quote me exactly on that, I generally don't push the range very high!). You can find exact polynomial solutions to linear PDEs with such powers using high enough order finite difference methods for example.

However, the story about non-determinism is no myth. The intel processors have a separate math coprocessor that supports 80bit floats (https://en.wikipedia.org/wiki/Extended_precision#x86_extende...). Moving a float from a register in this coprocessor to memory truncates the float. Repeated math can be done inside this coprocessor to achieve higher precision so hot loops generally don't move floats outside of these registers. Non-determinism occurs in programs running on intel with floats when threads are interrupted and the math coprocessor flushed. The non-determinism isn't intrinsic to the floating point arithmetic but to the non-determinism of when this truncation may occur. This is more relevant for fields where chaotic dynamics occur. So the same program with the same inputs can produce different results.

NaN is an error. If you take the square root of a negative number you get a NaN. This is just a type error, use complex numbers to overcome this one. But then you get 0. / 0. and that's a NaN or Inf - Inf and a whole slew of other things that produce out of bounds results. Whether it is expected or not is another story, but it does mean that you are unable to represent the value with a float and that is a type error.

[-]

AshamedCaptain 18 hours ago

> Non-determinism occurs in programs running on intel with floats when threads are interrupted and the math coprocessor flushed

That's ridiculous. No OS in his right mind would flush FPU regs to 64 bits only, because that would break many things, most obviously "real" 80 bit FP which is still a thing and the only reason x87 instructions still work. It would even break plain equality comparisons making all FP useless.

For 64 bit FP most compilers prefer SSE rather than x87 instructions these days.

[-]

dur-randir 12 minutes ago

https://bugs.php.net/bug.php?id=53632

Never, for sure.

nyeah 15 hours ago

NaN is not necessarily an error. It might be fine. It depends on what you're doing with it.

If NaN is invalid input for the next step, then sure why not treat it as an error? But that's a design decision not an imperative that everybody must follow. (I picture Mel Brooks' 15 commandments, #11-15, that fell and broke. This is not like that.)

[-]

BlackFly an hour ago

Sure, not every runtime type error needs to panic your application, nor even panic a request handler, nor even result in a failed handling. That doesn't mean you didn't encounter an error though. Error doesn't mean fatal.

The design decision isn't, "Was this a type error?" but, "What do I need to do about this type error?"

kccqzy 9 hours ago

Yes it is totally fine. I've seen code basically treating NaN as missing data as opposed to std::optional<double> or equivalent in your language. NaN propagates so this works like using such types as a monad.

jcranmer 16 hours ago

Wow, you're crossing a few wires in your zeal to provide information to the point that you're repeating myths.

> The intel processors have a separate math coprocessor that supports 80bit floats

x86 processors have two FPU units, the x87 unit (that you're describing) and the SSE unit. Anyone compiling for x86-64 uses the SSE unit for default, and most x86-32 compilers still default to SSE anyways.

> Moving a float from a register in this coprocessor to memory truncates the float.

No it doesn't. The x87 unit has load and store instructions for 32-bit, 64-bit, and 80-bit floats. If you want to spill 80-bit values as 80-bit values, you can do so.

> Repeated math can be done inside this coprocessor to achieve higher precision so hot loops generally don't move floats outside of these registers.

Hot loops these days use the SSE stuff because they're so much faster than x87. Friends don't let friends use long double without good reason!

> Non-determinism occurs in programs running on intel with floats when threads are interrupted and the math coprocessor flushed.

Lol, nope. You'll spill the x87 register stack on thread context switch with FSAVE or FXSAVE or XSAVE, all of which will store the registers as 80-bit values without loss of precision.

That said, there was a problem with programs that use the x87 unit, but it has absolutely nothing to do with what you're describing. The x87 unit doesn't have arithmetic for 32-bit and 64-bit values, only 80-bit values. Many compilers, though, just pretended that the x87 unit supported arithmetic on 32-bit and 64-bit values, so that FADD would simultaneously be a 32-bit addition, a 64-bit addition, and a 80-bit addition. If the compiler needed to spill a floating-point register, they would spill the value as a 32-bit value (if float) or 64-bit value (if double), and register spills are pretty unpredictable for user code. That's the nondeterminism you're referring to, and it's considered a bug in every compiler I'm aware of. (See https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p37... for a more thorough description of the problem).

[-]

BlackFly an hour ago

It isn't zeal, it's 15 years past hazy memory of getting different results on different executions in the same supercomputer. The story that went around was the one I relayed, but certainly your link does a better job explaining things that happen in the user perspective section:

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p37...

Summarized as,

> Most users cannot be expected to know all of the ways that their floating-point code is not reproducible.

Glad to know that the situation is defaulting to SSE2 nowadays though.

bobmcnamara 17 hours ago

> Non-determinism occurs in programs running on intel

FTFY. They even changed some of the more obscure handling between 8087,80287,80387. So much hoop jumping if you cared about binary reproducibility.

Seems to be largely fixed with targeting SSE even for scalar code now.

bee_rider 18 hours ago

I like these. They are push back against the sort of… first correction that people make when encountering floating point weirdness. That is, the first mistake we make is to treat floats as reals, and then we observe some odd rounding behavior. The second mistake we make is to treat the rounding events as random. A nice thing about IEEE floats is that the rounding behavior is well defined.

Often it doesn't matter, like you ask for a gemm and you get whatever order of operations blas, AVX-whatever, and OpenMP conspire to give you, so it is more-or-less random.

But if it does matter, the ability to define it is there.

[-]

zahlman 6 hours ago

On the other hand, it feels wrong to me to call these "myths" when they are really just simplifications.

(And I have heard of the 80-bit internal register thing described in https://news.ycombinator.com/item?id=44888692 causing real problems for people before. And --ffast-math is basically spooky action at a distance considering how it bleeds into the entire program; see e.g. https://moyix.blogspot.com/2022/09/someones-been-messing-wit....)

jandrese 17 hours ago

> A nice thing about IEEE floats is that the rounding behavior is well defined.

Until it isn't. I used to play the CodeWeavers port of Kohan and while the game would allow you to do crossplay between Windows and Linux, differences in how the two OSes rounded floats would cause the game to desynchronize after 15-20 minutes of play or so. Some unit's pathfinding algorithm would zig on Windows and zag on Linux, causing the state to diverge and end up kicking off one of the players.

[-]

bee_rider 17 hours ago

Is it possible that your different operating systems just had different mxcsr values?

Or, since it was a port, maybe they were compiled with different optimizations.

There are a lot of things happening under the hood but most of them should be deterministic.

[-]

toolslive 15 hours ago

until someone compiles with --ffast-math enabled, stating "I don't care about accuracy, as long as it's fast".

[-]

bee_rider 14 hours ago

It is good to enable that flag because it also enables the “fun safe math optimizations” flag, and it is important to remind people that math is a safe way to have fun.

[-]

toolslive 14 hours ago

"Friends don't let friends use fast-math"

https://simonbyrne.github.io/notes/fastmath/

jcranmer 15 hours ago

The differences are almost certainly not in how the two OSes rounded floats--the IEEE rounding modes are standard, and almost no one actually bothers to even change the rounding mode from the default.

For cross-OS issues, the most likely culprit is that Windows and Linux are using different libm implementations, which means that the results of functions like sin or atan2 are going to be slightly different.

[-]

zokier 14 hours ago

Problem with rounding modes and other fpenv flags is that any library anywhere might flip some flag and suddenly the whole program changes behavior.

quietbritishjim 18 hours ago

When writing code, it's useful to follow simpler rules than strictly needed to make it easier to read for other coders (including you in the future). For example, you might add some unnecessary brackets to expressions to avoid remembering all the operator precedence rules.

The article is right that sometimes you can can use simple equality on floating point values. But if you have a (almost) blanket rule to use fuzzy comparison, or (even better) just avoid any sort if equality-like comparison altogether, then your code might be simpler to understand and you might be less likely to make a mistake about when you really can safely do that.

It's still sensible to understand that it's possible though. Just that you should try to avoid writing tricky code that depends on it.

[-]

kccqzy 15 hours ago

Every time I use exact floating point equality I simply write a comment explaining why I did this. That's what comments are for.

zokier 16 hours ago

That sort of argument can be problematic though, because the code can become misleading. If I see a fuzzy comparison then I'd assume it is there because it is needed, which might make the rest of the code more difficult to understand/modify because I then have to assume that everywhere else the values might be fuzzy.

glkindlmann 16 hours ago

Note: the improved loop in the "1. They are not exact" can easily hang. If count > 2^24, then the ulp of f is 2, and adding 1.0f leaves f unchanged. What's wild is that a few lines later he notes how above 2^24 numbers start “jumping” every 2. Ok, but FP are always "jumping", by their ulp, regardless of how big or small they are.

BugsJustFindMe 18 hours ago

I've never heard anyone say any of these supposed myths, except for the first one, sort of, but nobody means what the first one pretends it means, so this whole post feels like a big strawman to me.

[-]

Cieric 18 hours ago

Just to add some context Adam works at AMD as a dev tech, so he is constantly working with game studios developers directly. While I can't say I've heard the same things since I'm somewhere else, I have seen some of the assumptions made in some shader code and they do line up with the kind of things he's saying.

rollcat 18 hours ago

People do say a lot of nonsense about floats, my younger self trying to bash JavaScript included. They e.g. do fine as integers - up to 2**53; this could be optimised by a JIT to use actual integer math.

jbjbjbjb 18 hours ago

I was putting together some technical interview questions for our candidates and I wanted to see what ChatGPT would put for an answer. It told me floating point numbers were non-deterministic and wouldn’t back down from that answer so it’s getting it from somewhere.

AshamedCaptain 18 hours ago

I concur, it seems low effort, and the only real common "myth" (the 1st one) is not really disproven. Infact the very example he puts goes to prove it, as it is going to become an infinite loop given large enough N....

Also compiler optimizations should not affect the result. Specially without "fast-math/O3" (which arguably many people stupidly use nowadays, then complain).

[-]

zokier 15 hours ago

> Also compiler optimizations should not affect the result. Specially without "fast-math/O3" (which arguably many people stupidly use nowadays, then complain).

-ffp-contract is annoying exception to that principle.

jandrese 17 hours ago

The first rule is true, but relying on it is dangerous unless you are well versed in what floats can and can not be represented exactly. It's best to pretend it isn't true in most cases to avoid footguns.

janalsncm 9 hours ago

Same, but I still learned a fair amount anyways. I say no harm, no foul.

pklausler 16 hours ago

People need to understand rounding better, especially the topic of when rounding can happen and when it can't for the basic operations.

Updating with a concrete example: The Fortran standard defines the real MOD and MODULO intrinsic functions as being equivalent to the straightforward sequence of a division, conversion to integer, multiplication, and subtraction. This formula can round, obviously. But MOD can be (and thus should be) implemented exactly by other means, and most Fortran compilers do so instead. This leaves Fortran implementors in a bit of a pickle -- conform to the standard, or produce good results?

[-]

camgunz 9 hours ago

Definitely. People think they'll get out of knowing how rounding works by using arbitrary precision arithmetic, but arguably it's even more important there (you run out of precision/memory at some point; what do you think happens then?). You can use floats for money if you do the rounding right.

vismit2000 4 hours ago

A famous old interview question regarding floating points: https://stackoverflow.com/questions/6699066/in-which-order-s...

This awesome article drives home the point very succinctly: https://fabiensanglard.net/floating_point_visually_explained...

rollcat 18 hours ago

Related: What Every Computer Scientist Should Know About Floating-Point Arithmetic <https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.h...>

[-]

hans_castorp 18 hours ago

Also related:

https://floating-point-gui.de/

[-]

zahlman 6 hours ago

My list includes: https://0.30000000000000004.com/

ivankra 14 hours ago

My favorite trick: NaN boxing. NaN's aren't just for errors, but also for smuggling other data inside. For a double, you have whopping 53 bits of payload, enough to cram in a pointer and maybe a type tag, and many javascript engines do (since JS numbers are double's after all)

https://wingolog.org/archives/2011/05/18/value-representatio...

https://piotrduperas.com/posts/nan-boxing

[-]

Arnavion 14 hours ago

It is also how RISC-V floating point registers are required to store floats of smaller widths. Eg if your CPU supports 64-bit floats (D extension), its FPU registers will be 64-bit wide. If you use an instruction to load a 16-bit float (Zfh extension) into such a register, it will be boxed into a negative quiet NaN with all bits above the lower 16 bits set to 1.

pklausler 14 hours ago

52 bits of payload, and at least one bit must be set.

[-]

ivankra 14 hours ago

You can put stuff into the sign bit too, that makes 53. Yeah, the lower 52 bits can't all be zero - that'd be ±INF, but the other 2^53-2 values are all yours to use.

[-]

pklausler 14 hours ago

It's possible for the sign bit of a NaN to be changed by a "non-arithmetic" operation that doesn't trap on the NaN, so don't put anything precious in there.

zX41ZdbW 6 hours ago

This is a great article!

A few more resources on this topic:

https://randomascii.wordpress.com/category/floating-point/ - a series of articles about floating point, with deeper details.

https://float.exposed/ - explorer of floating point values by Bartosz Ciechanowski.

[-]

zX41ZdbW 6 hours ago

Fun facts about floating point, from my practice:

Obvious: An array containing NaNs will almost certainly be sorted incorrectly with std::sort.

Non-obvious: sorting an array containing NaNs with std::sort from libstdc++ (gcc's implementation) could lead to a buffer overflow.

Non-obvious: using MMX instructions without clearing the FPU state leads to incorrect results of floating-point instructions in other parts of the code base: https://github.com/simdjson/simdjson/issues/169

calibas 14 hours ago

> They are not exact

It's not exactly a myth, as the article mentions, they're only exact for certain ranges of values.

> NaN and INF are indication of an error

This is somewhat semantic, but dividing by zero typically does create a hardware exception. However, it's all handled behind the scenes, and you get "Inf" as the result.

You can make it so dividing by zero is explicitly an error, see the "ftrapping-math" flag.

[-]

GuB-42 6 hours ago

I think the myth about exactness is that you can't use strict equality with floating point numbers because they are somehow fuzzy. They are not.

Some operations involve rounding, notably conversion from decimal, but the (rounded) result is an exact number that can be stored and recovered without loss of precision and equality will work as expected. Converting from floating point binary to decimal is exact though, given enough digits (can be >1000 for very small or very large numbers).

layer8 6 hours ago

Archive of the defunct tweet link: https://web.archive.org/web/20210608061416/https://twitter.c...

AndriyKunitsyn 13 hours ago

Here's one that's not a myth: IEEE-754 floats are the only "primitive types" that allow "a == a" to not be true.

I.e., two floats that are _identical_ to each other (even when it's _the same_ variable, on the same memory address) can be not _equal_ to each other, specifically if it's NaN. This is dictated by IEEE-754, and this is true for all programming languages I know, and to this day, this makes zero sense to me, but apparently this is useful for some reason.

[-]

zahlman 6 hours ago

>but apparently this is useful for some reason.

It lets you easily test whether a value is NaN without needing a library function call. (Even if the language wanted to provide a NaN literal, there's more than one NaN value, so that wouldn't actually work!)

analog31 9 hours ago

>>> A beginner would use them, trusting they are infinitely capable and precise, which can lead to problems. An intermediate programmer knows that they have some limitations, and so by using some good practices the problems can be avoided.

Amusingly, pocket calculators became affordable while I was in high school, and anybody who was interested in math learned the foibles of floating point almost instantly. Now it's an advanced topic. A difference may have been that our calculators had very few significant digits, so the errors introduced by successive calculations became noticeable more quickly. Also, you had to think about what you were doing because, even if you trusted the calculator, you didn't trust your fingers.

[-]

pclmulqdq 8 hours ago

Modern pocket calculators now use some very fancy math that makes them have much less error than if they used floating point. Many of them will use arbitrary-precision arithmetic at minimum, and there are some fancier schemes.

fiforpg 6 hours ago

In the context of double precision the article says

> the largest integer value that can be represented exactly is 2^53

— I am confused as to why it not 2^52, given that there are 52 bits of mantissa, so relative accuracy is 2^-52, which translates to absolute accuracy larger than 1 after 2^52. Compare this to the table there saying "Next value after 1 = 1 + 2^-52".

[-]

pclmulqdq 6 hours ago

There's an implied one bit, so you actually have a 53 bit significand (and 53-bit precision) given only a 52 bit mantissa.

[-]

fiforpg 6 hours ago

Right, I did realize after posting that close to numbers of the form

1{hidden bit} + (1-2^-52){mantissa with all ones}

the relative accuracy — corresponding to the absolute accuracy of a single bit in mantissa — is about 2^-53. The hidden bit is easy to forget about...

layer8 6 hours ago

https://stackoverflow.com/questions/1848700/biggest-integer-...

AtlasBarfed 15 hours ago

The core thing to know about floating point numbers in comp langs is they aren't floating point numbers.

They are approximations of floating point numbers, and even if your current approximation of a floating point number to represent a value seems to be accurate as an integer vale...

There is no guarantee if you take two of those floating point number approximations that appear completely accurate that the resulting operation between them will also be completely accurate.

[-]

kccqzy 15 hours ago

That's not a useful way to think. Floating point numbers are just floating point numbers. You aren't approximating floating point numbers. You are approximating real numbers.

Floating point numbers are compared against fixed point numbers where the point (the part between the integer part and fractional part) is fixed. That is why they are called floating. The nature of the real numbers is such that they can in general only be approximated, regardless of whether you use fixed point numbers or floating point numbers or a fancier computable number representation.

[-]

ogogmad 8 hours ago

The real numbers can be "represented" and "computed with" without needing approximations. The result is that there are no rounding errors. But it's not popular, and not necessarily worth the tradeoffs. On the other hand, see the Android calculator.

[-]

kccqzy 8 hours ago

The Android calculator doesn't support all real numbers. It supports a specific subset of computable numbers that are computable using the operations presented in the calculator app. It's a good domain-specific number representation. It's far from real numbers.

We can't even represent all the natural numbers: that would require infinite memory. Now you want to represent the real numbers, equinumerous with the power set of natural numbers?

[-]

ogogmad 7 hours ago

> We can't even represent all the natural numbers

We can represent all real numbers to the same extent that we can represent all natural numbers. They're just infinite strings, with each additional letter in the string mattering less and less as it goes along. The "Type Two Effectivity" model for modelling computations with infinite strings doesn't limit you to using just the computable real numbers. An uncomputable real number can be produced by outputting an infinite string letter by letter with each letter getting picked at random. The TTE model does OTOH limit you to computing only with the computable functions on those real numbers.

The TTE model basically uses (Python-like) generators to output an endless sequence of interval approximations to a real number.

[-]

Y_Y 6 hours ago

Georg Cantor would like a word.

Representable numbers are countable, they have measure zero in the reals.

zahlman 6 hours ago

> An uncomputable real number can be produced by outputting an infinite string letter by letter with each letter getting picked at random.

This certainly outputs an uncomputable real number, given that you have true random input and not a PRNG.

But random numbers from a seed and an algorithm are by definition computable.

And if you can't compute the number, you have no basis for the claim that it's the right number.

Also, good luck computing with arbitrarily long, computed-on-demand digit strings. Consider: how do you know how many digits you need from the input in order to be sure of N correct digits in the output? This might be possible in general, but it certainly doesn't sound like my idea of fun.