This problem of what exactly a color value means is mostly inconsequential when you have 8 bits per component, the difference in the denominator being either 255 or 256 makes the errors tiny, you must have really good color perception and get really close to the screen to see any difference at all, and your monitor/phone screen is probably not calibrated anyway, so who cares.
It becomes a pain in the ass when you're generating a VGA signal with a microcontroller with 8 color output pins (3 red, 3 green, 2 blue). The meaning of a color value is very real in this setup: it corresponds to a voltage level you must send to the VGA monitor, 0V-0.7V.
So the blue channel will map (0->0V, 1->0.23V, 2->0.47V, 3->0.7V), and the red/green will map (0->0V, 1->0.1V, ..., 7->0.7V). Notice how none of the blue voltages match any of the red/green ones (other than the extremes)? That means you don't get to see any pure grays -- the closest ones will have bit of blue or yellow tint, depending on the direction of the difference.
Not only that, any gradients at all (other than the ones not mixing blue with the other channels) will be noticeable off: for example, the closest colors in the line between pure red to pure white will all be slightly orange or purple.
This was a genuinely thought provoking article. I had to challenge some personal assumptions.
Coming from an electrical engineering background, I disagree with how the author presented "Two types of quantizers". Mathematically rigorous, but not grounded in practical systems.
In ADCs, there is always an inherent +-1/2 LSB of quantisation uncertainty. The transfer characteristic is always mid-tread sampling, or at least I haven't come across any counter examples. This is true for bipolar or unipolar ADCs.
The lowest code is negative voltage reference, and the highest code the positive reference. The transfer characteristic plot will show what the author has demonstrated, that the highest and lowest bins will effectively be 1/2LSB in width.
In a unipolar system, this has the consequence of not being able to represent the midpoint voltage precisely, or in other words, the gray problem. In a bipolar system, 0V will be mid-tread N/2 value, but that doesn't mean it has "256 ranges".
So, I'll be sticking with (VREF+ - VREV-) * k / (2^N - 1). Or in other words I agree with the normalisation by 255. It's the fence post error all over again, you have N values, but N-1 ranges. If you have less ranges than you do values, you need to distribute 1 of those ranges between two values, hence the 1/2LSB range endpoints.
show comments
glkindlmann
This is like a 1D version of what a scientific computing person might describe as the distinction between node-centered ("standard" or "mid-tread" in this post) vs cell-centered ("alternative", "mid-riser") samples: do consider values to be at the middles of bins (or middles of triangles, or middle of tetrahedra), or the boundaries between intervals (or vertices of triangles, or of tets).
In a scientific computing setting it would be insane to start doing data processing without knowing how to interpret the values. In the context of audio signal processing, if you just get a stream of integers, you'd have to know the representational intent of those integers (mu-law encoding or linear?) if you're going to compute anything about the underlying signal. The meta-data accompanying the values would hopefully provide the answer.
But with 8-bit pixel values, absent any meta-data from a competent file format that can communicate the representational intent, we're adrift and there's no right answer (like the author says). Certainly no one can fault you for picking whichever one seems to be give better results for your application, but you can raise awareness that bits without context have had their meaning undermined.
show comments
BearOso
There is a fallacy here in assuming there's 256 steps from 0 to 255. That's not true, there's 256 values that can be represented in 8 bits, and 255 steps (spaces between those values) from 0 (black) to (255) pure white. Thus, the division by 255 isn't problematic. Of course, 128 isn't half grey, it isn't in 0-255 and quantized 8-bit values are almost always in sRGB, not linear perceptive space.
This is the same kind confusion that happens with sampling positions in modern APIs, where the location is specified in coordinates and not in pixel centers.
show comments
jyounker
From an algebraic standpoint, the answer is clearly f(x) -> [0, 255].
If you don't have f(n * 0) == n * f(0), then all sorts of weird stuff happens, like:
For f(x) -> [0.5/8,7.5/8] then f(0) + f(0) + f(0) = 0.5/8 + 0.5/8 + 0.5/8 = 1.5/8 != f(0)
Choosing the latter means that if you do a calculation on the x side, then you can't expect it to match the calculation on the f(x) side.
herf
I'll argue for the +0.5 solution. First, I don't like half-sized intervals at the edges, and second, a 255-based representation is typically a SDR (not HDR) image.
RGB values represent luminances against some adapted state, and a "zero" in a daylit scene is not "zero luminance" - it's just about 0.001x as bright as the brightest point - it's millions of photons, way more than zero. In a sense our eyes experience contrast on a sliding scale, and there is no absolute zero in the system. For example, broadcast systems historically used 16-235 as their luminance range for SDR. I think any argument that says "we must have zero" is going to have a bias, but I don't think zero is needed for most things.
show comments
dudu24
If you have a ruler and it goes to 12 inches, you should normalize by the length L and not by 13, the number of points on the ruler.
show comments
Nuthen
That was a fun article to read of something I haven't had to think about in a while. It brought to mind moments in game development of having pixel art needing to be drawn on an integer value despite the game logic using floating point math. I tried something similar to the +0.5 in places so that it wouldn't look as bad (especially when there's a moving camera, which also needed to be truncated..).
I also enjoyed the 2002 article by Jonathan Blow [1] that's linked at the bottom. The visualization from the first article helped a lot once this started to go more in-depth.
Dammit, I have an 80%-written article covering the same issue but for ADCs, and had to put it aside for the past few months. There's historical precedent here from the 1960s and 1970s, and in large part it involves testing and definitions of gain and offset error in ADCs.
Someday I'll finish... :-(
virtualritz
255.0. Everything else makes no sense and is actually dangerous when working with colors. Trust me. :)
And when you go from float to 8bit you should dither to avoid banding.
If in doubt, error diffusion with a random number between -0.5..=0.5 is fine. 0.5 here is dither_amplitude:
No, the "alternative" approach looks strange in the 7 bit example.
1.0 lies on the right side of the bin 7. But 0.0 lies on the left of bin 0.
The standard approach assumes that we have centered samples: that zero is dead black, plus (and minus!) some uncertainty and so is bin 7.
If the sampling of the intensity is distortion-free (no clipping took place due to overexposure) then bin 7 represents a range of possible values centered around 1.0.
It is not a half-sized interval.
> This means that when converting floating-point values in the
[0,1] range back to integers, the extreme bins have effectively half the width of other bins.
Under any interpretation whatsoever of the image samples, there is latitude for interpreting the maximum value 255 as being distortion: clipping from an arbitrarily higher value. Shifting things by 0.5 doesn't fix this issue of not knowing whether 255 means that an intensity close to 1.0 is being represented (no distortion), or an outlier intensity of 37.49 (severely clamped). That could go the other way too.
In other words, there is a possible bias in the extreme bin. The signal could be limited such that the bin's full sampling range is not in effect, or the signal could be overwhelming, so that values far outside of the range are clipped and included.
The only way around this is to make the highest value a canary which represents "clipped value". That is to say, 255 means "clipped datum", so that only 254 and below is sampling of unclipped signal. Machine-generated image (e.g. 3D rendering) then avoid the 255 value, and camera sensors are calibrated so that it doesn't occur when technical images are being shot.
dpark
The entire issue arises from the use of truncation, right? It guarantees that only an exact 1.0 could land in the 255 bin so the net effect is a reduction of 256 bins to 255 bins. (Using random numbers as shown also guarantees no 1.0.)
Why not scale to fill the available bins, though? i.e. trunc(result * 255.999)?
show comments
Sesse__
You should multiply by 255.0, optionally add a dither (triangular is okay), and then let the FPU round using its default IEEE 754 round-to-nearest-ties-to-nearest-even mode. None of this crazy 0.5 stuff. :-)
"While in theory there are cases where you might want to use either type of quantization, if you are in games don't do that!
The reason is that the GPU standard for UNORM colors has chosen "centered" quantization, so you should do that too."
Retr0id
Both of these assume a linear transfer function, which is rarely the case.
show comments
jessetemp
The author is confusing bins with bin edges. In their first plot, the standard approach looks strange because 0-7 should be the bin edges, not the center points as shown in the plot.
You can see this confusion again in the histogram example. There are only 255 bins, not 256. If you fix that mistake and remove the 0.5 offset, then the histogram is distributed correctly at both ends.
show comments
theyeenzbeanz
Should always be 0-255 as that fits an unsigned byte.
show comments
crazygringo
Advice for anyone on mobile: read in landscape mode if you want to be able to see the division by 256 version code example at the start.
The HTML/CSS is bad that lets it completely overflow the right edge of the page instead of wrapping.
I re-read this post three times in total confusion before I figured out the most important piece was off-screen entirely.
2001zhaozhao
There are only two real solutions after factoring in the need to preserve black as zero.
They are "rgb / 255.0" vs. "rgb / 256.0". Both have different tradeoffs. Pick your poison. (If you're using a 8 bit display signal then you better match whatever value the OS picked for the mapping back to the display, so your RGB values pass through unchanged)
MyMemoryfails
As game dev, i never understood why floats are used to present colors? Isn't integers better? The issues which this article mentioned wouldn't exist.
I can only think its due integers having undefined behavior what happens on overflow, usually its wrapping but not always.
show comments
orlp
When going from float to u8 you should add a triangular dither. It makes a world of difference for grayscale gradients, even in 24bit truecolor.
nasitsony
No, I do not think so
wyager
You don't need to make this judgement; it's fixed by the colorspace you're working in.
First, figure out what colorspace the processing needs to happen in. Usually this is linear RGB.
Then, figure out what OETF and EOTF your input/output format use. This will be something like PQ or HLG. This will exactly specify the meaning of each integer value.
This fixes the choice of representation and conversion.
atilimcetin
Interesting article. I tend to use
- i = min(floor(f * 256), 255) (from float to uint8)
- f = i / 255 (from uint8 to float)
Basically a mix of the 2 approaches mentioned in the article.
For all integers between [0,255], if I do uint8 -> float -> uint8 conversion, I will get the same result.
--
edit: I wondered what's the maximum jitter amount that I can introduce to the float and get the same uint8 value. And also these 0->0.0 and 255->1.0 should map properly.
With my approach at the top, the jitter margin that I can introduce is 1/65280.
But with the article's approach
- i = floor(f * 255 + 0.5)
- f = i / 255
maximum jitter margin is 1/510 (which is better).
show comments
AlienRobot
Case against 255: it looks wrong in the graph :(
Case against 256: no 0 or 1 values :(
Considering how important having a 0 and 1 value is for arithmetic in general, I think 255 is better.
dist-epoch
A similar issue exists in the audio world, for example 16-bit integer audio is between [-32768, 32767] (non-symmetric), but floating point audio is [-1.0, 1.0].
show comments
RobRivera
Are we talking 0 or 1 based values? HONKHONK*
JamesTRexx
Why not (uint8_t) ( float * (256/255) ) * 255?
Zardoz84
0-255
Using 1-256 i find it weird
ctdinjeu8
Both. 255 for each color and the last 1 as the alpha for each channel.
Why not??? Fight me
DigitallyFidget
255 gives 0-255, which gives you a zero value. 256 is 1-256, you lose the option of setting 0.
show comments
dgently7
"Let’s say you’re writing an image processing program. The program takes in an image, converts it to floating point, does some processing and finally saves the modified pixels to disk as 8-bit colors. "
excuse to argue about the best way aside, if this is the goal you should not be rolling your own image file reading. you should use openimageio. idk what approach it takes in its internal conversion to float, but that library is more likely to have the right answer than you trying to roll it yourself given its the library used internally by tons of professional image manipulation software...
This problem of what exactly a color value means is mostly inconsequential when you have 8 bits per component, the difference in the denominator being either 255 or 256 makes the errors tiny, you must have really good color perception and get really close to the screen to see any difference at all, and your monitor/phone screen is probably not calibrated anyway, so who cares.
It becomes a pain in the ass when you're generating a VGA signal with a microcontroller with 8 color output pins (3 red, 3 green, 2 blue). The meaning of a color value is very real in this setup: it corresponds to a voltage level you must send to the VGA monitor, 0V-0.7V.
So the blue channel will map (0->0V, 1->0.23V, 2->0.47V, 3->0.7V), and the red/green will map (0->0V, 1->0.1V, ..., 7->0.7V). Notice how none of the blue voltages match any of the red/green ones (other than the extremes)? That means you don't get to see any pure grays -- the closest ones will have bit of blue or yellow tint, depending on the direction of the difference.
Not only that, any gradients at all (other than the ones not mixing blue with the other channels) will be noticeable off: for example, the closest colors in the line between pure red to pure white will all be slightly orange or purple.
Code for VGA output in 8-bit color with double-buffered 320x240 framebuffer for the Raspberry Pi Pico 2 here, if anyone cares: https://github.com/moefh/pico-vga-8bit-demo
This was a genuinely thought provoking article. I had to challenge some personal assumptions.
Coming from an electrical engineering background, I disagree with how the author presented "Two types of quantizers". Mathematically rigorous, but not grounded in practical systems.
In ADCs, there is always an inherent +-1/2 LSB of quantisation uncertainty. The transfer characteristic is always mid-tread sampling, or at least I haven't come across any counter examples. This is true for bipolar or unipolar ADCs.
The lowest code is negative voltage reference, and the highest code the positive reference. The transfer characteristic plot will show what the author has demonstrated, that the highest and lowest bins will effectively be 1/2LSB in width.
In a unipolar system, this has the consequence of not being able to represent the midpoint voltage precisely, or in other words, the gray problem. In a bipolar system, 0V will be mid-tread N/2 value, but that doesn't mean it has "256 ranges".
So, I'll be sticking with (VREF+ - VREV-) * k / (2^N - 1). Or in other words I agree with the normalisation by 255. It's the fence post error all over again, you have N values, but N-1 ranges. If you have less ranges than you do values, you need to distribute 1 of those ranges between two values, hence the 1/2LSB range endpoints.
This is like a 1D version of what a scientific computing person might describe as the distinction between node-centered ("standard" or "mid-tread" in this post) vs cell-centered ("alternative", "mid-riser") samples: do consider values to be at the middles of bins (or middles of triangles, or middle of tetrahedra), or the boundaries between intervals (or vertices of triangles, or of tets).
In a scientific computing setting it would be insane to start doing data processing without knowing how to interpret the values. In the context of audio signal processing, if you just get a stream of integers, you'd have to know the representational intent of those integers (mu-law encoding or linear?) if you're going to compute anything about the underlying signal. The meta-data accompanying the values would hopefully provide the answer.
But with 8-bit pixel values, absent any meta-data from a competent file format that can communicate the representational intent, we're adrift and there's no right answer (like the author says). Certainly no one can fault you for picking whichever one seems to be give better results for your application, but you can raise awareness that bits without context have had their meaning undermined.
There is a fallacy here in assuming there's 256 steps from 0 to 255. That's not true, there's 256 values that can be represented in 8 bits, and 255 steps (spaces between those values) from 0 (black) to (255) pure white. Thus, the division by 255 isn't problematic. Of course, 128 isn't half grey, it isn't in 0-255 and quantized 8-bit values are almost always in sRGB, not linear perceptive space.
This is the same kind confusion that happens with sampling positions in modern APIs, where the location is specified in coordinates and not in pixel centers.
From an algebraic standpoint, the answer is clearly f(x) -> [0, 255].
If you don't have f(n * 0) == n * f(0), then all sorts of weird stuff happens, like:
For f(x) -> [0, 255] then f(0) + f(0) + f(0) = 0 + 0 + 0 = 0 = f(0)
For f(x) -> [0.5/8,7.5/8] then f(0) + f(0) + f(0) = 0.5/8 + 0.5/8 + 0.5/8 = 1.5/8 != f(0)
Choosing the latter means that if you do a calculation on the x side, then you can't expect it to match the calculation on the f(x) side.
I'll argue for the +0.5 solution. First, I don't like half-sized intervals at the edges, and second, a 255-based representation is typically a SDR (not HDR) image.
RGB values represent luminances against some adapted state, and a "zero" in a daylit scene is not "zero luminance" - it's just about 0.001x as bright as the brightest point - it's millions of photons, way more than zero. In a sense our eyes experience contrast on a sliding scale, and there is no absolute zero in the system. For example, broadcast systems historically used 16-235 as their luminance range for SDR. I think any argument that says "we must have zero" is going to have a bias, but I don't think zero is needed for most things.
If you have a ruler and it goes to 12 inches, you should normalize by the length L and not by 13, the number of points on the ruler.
That was a fun article to read of something I haven't had to think about in a while. It brought to mind moments in game development of having pixel art needing to be drawn on an integer value despite the game logic using floating point math. I tried something similar to the +0.5 in places so that it wouldn't look as bad (especially when there's a moving camera, which also needed to be truncated..).
I also enjoyed the 2002 article by Jonathan Blow [1] that's linked at the bottom. The visualization from the first article helped a lot once this started to go more in-depth.
[1] https://web.archive.org/web/20240706043551/https://number-no...
Dammit, I have an 80%-written article covering the same issue but for ADCs, and had to put it aside for the past few months. There's historical precedent here from the 1960s and 1970s, and in large part it involves testing and definitions of gain and offset error in ADCs.
Someday I'll finish... :-(
255.0. Everything else makes no sense and is actually dangerous when working with colors. Trust me. :)
And when you go from float to 8bit you should dither to avoid banding.
If in doubt, error diffusion with a random number between -0.5..=0.5 is fine. 0.5 here is dither_amplitude:
round(255 * input_value + dither_amplitude * random(-1, 1))
See e.g. my dithereens crate: https://crates.io/crates/dithereens
No, the "alternative" approach looks strange in the 7 bit example.
1.0 lies on the right side of the bin 7. But 0.0 lies on the left of bin 0.
The standard approach assumes that we have centered samples: that zero is dead black, plus (and minus!) some uncertainty and so is bin 7.
If the sampling of the intensity is distortion-free (no clipping took place due to overexposure) then bin 7 represents a range of possible values centered around 1.0.
It is not a half-sized interval.
> This means that when converting floating-point values in the [0,1] range back to integers, the extreme bins have effectively half the width of other bins.
Under any interpretation whatsoever of the image samples, there is latitude for interpreting the maximum value 255 as being distortion: clipping from an arbitrarily higher value. Shifting things by 0.5 doesn't fix this issue of not knowing whether 255 means that an intensity close to 1.0 is being represented (no distortion), or an outlier intensity of 37.49 (severely clamped). That could go the other way too.
In other words, there is a possible bias in the extreme bin. The signal could be limited such that the bin's full sampling range is not in effect, or the signal could be overwhelming, so that values far outside of the range are clipped and included.
The only way around this is to make the highest value a canary which represents "clipped value". That is to say, 255 means "clipped datum", so that only 254 and below is sampling of unclipped signal. Machine-generated image (e.g. 3D rendering) then avoid the 255 value, and camera sensors are calibrated so that it doesn't occur when technical images are being shot.
The entire issue arises from the use of truncation, right? It guarantees that only an exact 1.0 could land in the 255 bin so the net effect is a reduction of 256 bins to 255 bins. (Using random numbers as shown also guarantees no 1.0.)
Why not scale to fill the available bins, though? i.e. trunc(result * 255.999)?
You should multiply by 255.0, optionally add a dither (triangular is okay), and then let the FPU round using its default IEEE 754 round-to-nearest-ties-to-nearest-even mode. None of this crazy 0.5 stuff. :-)
See: cbloom's rant on quantization for deeper investigation: https://cbloomrants.blogspot.com/2020/09/topics-in-quantizat...
"While in theory there are cases where you might want to use either type of quantization, if you are in games don't do that!
The reason is that the GPU standard for UNORM colors has chosen "centered" quantization, so you should do that too."
Both of these assume a linear transfer function, which is rarely the case.
The author is confusing bins with bin edges. In their first plot, the standard approach looks strange because 0-7 should be the bin edges, not the center points as shown in the plot.
You can see this confusion again in the histogram example. There are only 255 bins, not 256. If you fix that mistake and remove the 0.5 offset, then the histogram is distributed correctly at both ends.
Should always be 0-255 as that fits an unsigned byte.
Advice for anyone on mobile: read in landscape mode if you want to be able to see the division by 256 version code example at the start.
The HTML/CSS is bad that lets it completely overflow the right edge of the page instead of wrapping.
I re-read this post three times in total confusion before I figured out the most important piece was off-screen entirely.
There are only two real solutions after factoring in the need to preserve black as zero.
They are "rgb / 255.0" vs. "rgb / 256.0". Both have different tradeoffs. Pick your poison. (If you're using a 8 bit display signal then you better match whatever value the OS picked for the mapping back to the display, so your RGB values pass through unchanged)
As game dev, i never understood why floats are used to present colors? Isn't integers better? The issues which this article mentioned wouldn't exist.
I can only think its due integers having undefined behavior what happens on overflow, usually its wrapping but not always.
When going from float to u8 you should add a triangular dither. It makes a world of difference for grayscale gradients, even in 24bit truecolor.
No, I do not think so
You don't need to make this judgement; it's fixed by the colorspace you're working in.
First, figure out what colorspace the processing needs to happen in. Usually this is linear RGB.
Then, figure out what OETF and EOTF your input/output format use. This will be something like PQ or HLG. This will exactly specify the meaning of each integer value.
This fixes the choice of representation and conversion.
Interesting article. I tend to use
- i = min(floor(f * 256), 255) (from float to uint8)
- f = i / 255 (from uint8 to float)
Basically a mix of the 2 approaches mentioned in the article.
For all integers between [0,255], if I do uint8 -> float -> uint8 conversion, I will get the same result.
--
edit: I wondered what's the maximum jitter amount that I can introduce to the float and get the same uint8 value. And also these 0->0.0 and 255->1.0 should map properly.
With my approach at the top, the jitter margin that I can introduce is 1/65280.
But with the article's approach
- i = floor(f * 255 + 0.5)
- f = i / 255
maximum jitter margin is 1/510 (which is better).
Case against 255: it looks wrong in the graph :(
Case against 256: no 0 or 1 values :(
Considering how important having a 0 and 1 value is for arithmetic in general, I think 255 is better.
A similar issue exists in the audio world, for example 16-bit integer audio is between [-32768, 32767] (non-symmetric), but floating point audio is [-1.0, 1.0].
Are we talking 0 or 1 based values? HONKHONK*
Why not (uint8_t) ( float * (256/255) ) * 255?
0-255
Using 1-256 i find it weird
Both. 255 for each color and the last 1 as the alpha for each channel.
Why not??? Fight me
255 gives 0-255, which gives you a zero value. 256 is 1-256, you lose the option of setting 0.
"Let’s say you’re writing an image processing program. The program takes in an image, converts it to floating point, does some processing and finally saves the modified pixels to disk as 8-bit colors. "
excuse to argue about the best way aside, if this is the goal you should not be rolling your own image file reading. you should use openimageio. idk what approach it takes in its internal conversion to float, but that library is more likely to have the right answer than you trying to roll it yourself given its the library used internally by tons of professional image manipulation software...