Whats the Max Valid Length (in bytes) of an Emoji?

Introduction

Recently I stumbled upon a Twitter conversation

If I want to store a single emoji in a Postgres field, what is the LIMIT I should set on the TEXT field?

which talked about storing a single emoji in a Postgres TEXT field, with a LIMIT.

The answer to this question is already stated in the Twitter conversation, which is 35 bytes. While this answer in this context is correct, it is a bit short. In fact we also need to specify the encoding, which is the Unicode 15 standard in this case, and how the data is stored, which is UTF-8.

While in this blog post, I do not want to discuss the performance implications, of changing the database field, I find the question of how many bytes a single emoji can have very interesting, as there are several aspects to it. So this blog post will be about character encoding, the Unicode standard, and emojis. Such that at the end we can hopefully state and understand the answer to the question “What’s the Max Valid Length (in bytes) of an Emoji?”.

Character encoding

To first clear up some definitions and speak about some backgrounds, we will start with character encoding in general. A character is any input that is required to express text, which includes non-visible characters, e.g. “ZERO WIDTH NO-BREAK SPACE” or characters that are not interpreted like a single glyph, e.g. \n.

Let us start with the smallest unit which can be used to represent data, the bit. A bit can have two values, which are usually labeled $0$ and $1$. These two states are realized on hardware which can have two distinct states, that depending on the use case, can be easily changed, read, or both. For example, an HDD uses the electron spin to store data, flash memory uses special transistors, called [metal–oxide–semiconductor field-effect transistors (MOSFET), and so on.

As this two-state representation is good for hardware, but not so good to work with data from a software perspective, another data unit has been introduced, the byte. A byte, in the historical sense, was the number of bits needed to encode a single character. Nowadays, when we speak of a byte without the specification of the number of bits, we usually mean a byte consisting of $2^3 = 8$ bits, also sometimes referred to as an “octet”. But as we will see, one byte will not be sufficient to encode all needed characters.

The number of bits needed to encode a character depends on how exactly the character is translated into bits, which is specified by the used character encoding. Of course to translate the bit series back into character, one needs to know the encoding, otherwise, the bit series is interpreted as a different character.

Unicode encoding

For the rest of this article, we will focus on Unicode encoding, which is a specific rule set of how characters are mapped to bit series. The Unicode standard is for the Linux world and the web, the commonly used encoding of text, and which we need to further understand to answer the initial question.

The Unicode encoding model consists of two separate parts. One part defines how characters map to so-called code points and the other part defines how these code points map to bits.

Code points

A Unicode code point, denoted by U+<hex number>, can be translated using the Unicode character code mapping, where <hex number> is a hexadecimal number, from $0000_{\mathrm{hex}}$ to $10\mathrm{FFFF}_{\mathrm{hex}}$. In general, one would omit the leading zeros, e.g. $0000 = 0$, but in the Unicode standard, the digits have different meanings depending on the position.

In Unicode, the hex numbers have the following format: $[0-10]_{\mathrm{hex}} [000 - \mathrm{FFFF}]_{\mathrm{hex}}$. Where the leading group, with $17$ possible values $[0 - 10]_\mathrm{hex} = [0 - 16]_\mathrm{dec}$, specifies the so called Unicode plane. The second group specifies the code point inside the specified plane, which allows specifying $2^{16} = 65\,536$ unique values, $[000 - \mathrm{FFFF}]_\mathrm{hex} = [0 - 65\,535]_\mathrm{dec}$. Note that for values below three digits, one explicitly adds leading zeros, such that the number of digits is at least three, as there are otherwise collisions of the Unicode code points. So in total, there exist $17*2^{16} = 1\,114\,112$ available code points in the Unicode standard, from which the Unicode 15 standard currently defines $149\,186$. For example, U+0041 represents the character A and U+0061 stands for a.

Using U+0000-10FFFF or U+0000..U+10FFFF we denote a range of Unicode code points, in this example, it is the full Unicode range. When we want to write a sequence we write U+0041 U+0042 U+0043 (ABC), where the spaces between the Unicode code points are just for readability.

The Unicode standard furthermore defines so-called Unicode blocks, without a fixed size, which are commonly used together. For a list refer to Wikipedia or the official specification.

Storing code points as bits

The Unicode code points are so to say an intermediate representation of characters, in hexadecimal values. But as initially stated, these values need to be further translated to ones and zeros, such that the two-state machine (computer) can process this data.

To solve this, the Unicode standard defines different mapping methods, where in this blog post we will only discuss the Unicode Transformation Format (UTF) encoding. More specifically the Unicode standard defines UTF-{8,16,32}, where for each of those standards the number defines how many bits one code unit should have. The code unit is the smallest possible number of bits that can be allocated together. Note that for UTF-8 the term code unit corresponds to the standard 8-bit byte.

To encode the Unicode range a maximum of 32 bits are needed, which means that for example for UTF-8 a maximum of $32 / 8 = 4$ bytes are required. For UTF-16 at maximum two code units (2 bytes) are needed and for UTF-32 always one code unit (4 bytes) is needed, to encode the whole Unicode alphabet.

Additionally, UTF-8 is binary compatible with, the prior to Unicode common American Standard Code for Information Interchange (ASCII) encoding, which requires a maximum of 7 bits to encode the full ASCII range, which includes all basic Latin characters. Note that this is not true for UTF-{16,32}.

Although UTF-32 would be the easiest format to implement since each “bit-combination” corresponds to one Unicode code point, the advantage of for example UTF-8 is that for values that require less than four bytes to be encoded, only this amount of bits need to be allocated. Hence UTF-{8,16} are also referred to as “variable-length character encoding”.

On the web, the most used encoding is UTF-8, which has potentially the largest savings in required bytes. In particular, since, all HTML Tags are made up only of ASCII characters, which can be encoded using a single byte, and most of the network protocols are designed to work with packages of 8-bit.

Nowadays UTF-32 is mostly used for the internal representation of text, to simplify working on this data. And UTF-16 is used, for example, in the internal Windows API for text (up to Windows 10), since although it requires more bits it has some performance benefits. But these performance benefits would be neglectable in comparison to the additional required transfer time when the data has to be sent over the network.

As data stored in databases, mostly has to be transferred over the network, storing it in a different format than UTF-8 would mean reencoding it and UTF-8 potentially has the most savings in required space. For example, Postgres only supports UTF-8 for Unicode encoding, in contrast to for example MySQL. As the initial question was regarding Postgres, we will for the rest of this blog post focus on UTF-8.

Emoji encoding in Unicode

To answer the initial question, we will also need to discuss how emojis are handled in Unicode. The Unicode standard has a technical report dedicated to how emojis are encoded. I will summarize the most important features of how emojis are built in Unicode, but there are of course also other blog posts, in particular, which discuses this topic in more detail.

A little side note, as the Unicode technical standard states, the term emoji comes from Japanese, where e stands for “picture” and moji for “written character”. For more details on the history of emojis, refer to Wikipedia.

The simplest way of encoding an emoji is by a single code point in the Unicode alphabet, as it is for example for the Emoticons Unicode block. But there are several mechanisms, which are also used elsewhere in the Unicode standard, to compose emoji from multiple Unicode characters. Such a combined character is also called a “grapheme cluster” and is also used for encoding other characters. For example, the “Combining Diacritical Marks” are used in combination with normal characters to create characters like “ü”.

There are several ways how one can compose emojis in Unicode. Three of those will be discussed in the following subsections, for a full list refer to the Unicode technical report.

Emoji presentation sequence

Before emojis were defined in the Unicode standard there already existed symbols, for example in the Dingbats Unicode block. To make use of those code points, and also use them as a fallback, if the byte series was not known, one way to compose an emoji is using the “emoji variation sequences”. For example, the (text) heart U+2764 (❤) can also be displayed as emoji U+2764 U+FE0F (❤️) using the “emoji presentation selector” U+FE0F. It also works the other way around using the “text presentation selector” U+FE0E, e.g. U+1F3AD (🎭) and the text version U+1F3AD U+FE0E (🎭︎).

For a list of all defined emoji variation sequences refer to this list.

Emoji modifier

Additionally, one can use “emoji modifiers” to compose emojis. For Unicode 15 the only used modifiers are the skin tone modifiers expressed by the Unicode code points U+1F3FB-1F3FF, which “modify” the preceding emoji. Of course not all “base emoji” support a skin color modification (e.g. a plane has no skin color).

An example would be the waving hand U+1F44B (👋) combined with the “medium-dark skin tone” skin color modifier U+1F3FE (​🏾) resulting in a “medium-dark waving hand” U+1F44B U+1F3FE (👋🏾). If one does not want those two code points to be joined into a single glyph, one can separate them by a “zero-width space” U+200B, i.e. U+1F44B U+200B U+1F3FE (👋​🏾).

Emoji sequences

The final method which we will discuss is “emoji sequences”. They are very similar to the emoji modifiers, but in this case, they explicitly need to be joined by another Unicode code point. In general, one emoji sequence can contain multiple joiners.

For example, the black cat is not created using a modifier, but rather U+1F408 U+200D U+2B1B (🐈‍⬛). This sequence is called an “emoji ZWJ sequence” since it contains the “zero width joiner” (ZWJ) U+200D character. Without this character, the two remaining Unicode code points U+1F408 U+2B1B (🐈⬛) are displayed as separate glyphs.

Of course, there can also be sequences built from emoji presentation sequences, for example, the “eye in a speech bubble” U+1F441 U+FE0F U+200D U+1F5E8 U+FE0F (👁️‍🗨️), built from the “eye” U+1F441 U+FE0F (👁️) and the “speech bubble” U+1F5E8 U+FE0F (🗨️).

The full list of sequences containing the ZWJ character can be found in this list.

Combining it all

Now we should have everything together, for answering the initial question. First, as already found in the Twitter thread, the longest sequence is when two people are kissing and both persons have skin tone modifiers, e.g. the first such sequence in the sequence list for Unicode 15 is “kiss: man, man, light skin tone” U+1F468 U+1F3FB U+200D U+2764 U+FE0F U+200D U+1F48B U+200D U+1F468 U+1F3FB (👨🏻‍❤️‍💋‍👨🏻), which is a total of ten Unicode code points.

Let us quickly go through this emoji. The first and last two Unicode code points are “man with light skin tone” U+1F468 U+1F3FB (👨🏻). These two emojis are joined using the ZWJ U+200D, to another combination of two emojis, again joined by the ZWJ, namely a heart U+2764 U+FE0F (❤️) and a kiss U+1F48B (💋). Note that U+2764 U+FE0F U+200D U+1F48B (❤️‍💋) alone is not combined into a single character.

As we know, each of those symbols can be composed of one to four bytes, the easiest way is to just count the bytes of the UTF-8-encoded string. Although we could read it from the corresponding plane, in which each of those characters is, this would require looking at UTF-8 encoding in more technical detail.

That being said, the byte sequence of the emoji 👨🏻‍❤️‍💋‍👨🏻, generated using xd, and annotated by the corresponding Unicode code points is the following:

F0 9F 91 A8     = U+1F468 (👨)
F0 9F 8F BB     = U+1F3FB (🏻)
E2 80 8D        = U+200D (ZWJ)
E2 9D A4        = U+2764 (❤)
EF B8 8F        = U+FE0F (emoji presentation sequence)
E2 80 8D        = U+200D (ZWJ)
F0 9F 92 8B     = U+1F48B (💋)
E2 80 8D        = U+200D (ZWJ)
F0 9F 91 A8     = U+1F468 (👨)
F0 9F 8F BB     = U+1F3FB (🏻)

Indeed when we count the bytes (a block of two hex numbers is one byte), we find that in total we need 35 bytes to encode this emoji. But we have to keep in mind, that this is only the maximum number of needed bytes for the Unicode 15 standard. With every new standard, there could be new sequences that need more bytes to be encoded.

Conclusion

While the answer to the initial question (35 bytes) was already known before, I took the chance to learn more about the technical details of what the Unicode standard defines.

In general, if I would need to define such a limit on a database text field, I probably would leave some space for the future, e.g. $8*3 + 10*4 = 64$ bytes, such that I do not need to adjust the limit for future Unicode standards (which I would probably forget). But before defining a LIMIT one should evaluate the performance gains and space reduction of such a change.