Anatomy of a Short Message
Back to Articles
SMS SMPP Unicode Encoding GSM

Anatomy of a Short Message

October 2, 2017 5 min
Aivis Olsteins

Aivis Olsteins

It is very well known fact, that a Short Message (SMS) should contain somewhere around 140 to 160 characters. Some devices or phones insist on 160, some on less. Twitter (whose origins are also closely related to short messages), for example has a limit of 140 symbols. Speakers of languages not based on Latin script will say this number is even less. Why this ambiguity and why exactly these numbers? And are these numbers correct at all?

First, let's start with the history. SMS dates back to the time when primary function of the phone was speech calling. There were no keyboards, no touch screens, and the only way of input was to use numeric keypads by assigning several alphabet letters to each key and letting select required letter by repeated pressing of the same key. The process of text entry in this way is slow, and the number of letters is relatively small. SMS was initially a part of GSM networks only, which were deployed in European countries, so it was deemed sufficient that it can serve a limited set of Latin alphabet, numbers, and some special characters. Taking parallels to ASCII, which is able to accommodate most needs of Latin-based writing system in a 7-bit encoding space, the encoding scheme for SMS also was comprised of 7-bits, and they mostly covered same characters as ASCII, with some exceptions. For example, control characters found in ASCII range 0x00 ... 0x20 was replaced by some characters found in European languages outside ASCII range (Greek, Nordic, Spanish etc). 

Not going into technical details, the resulting protocol allowed to transmit 160 characters of text, having 7-bit encoding. Thus, total space allocated for user data, is 160 x 7 = 1120 bits. That limit for SMS stands today and any further developments and variations always play from here.

With the spread of popularity of SMS, it became clear that there are several problems: 1) there are many languages which are unable to use it because of lack of support and 2) the 160 character limit is too small.

1. Language support

The solution of first problem seems easy at the first glance, but comes at the cost: let's use this 1120 bit space and represent each character with 8 bits. That essentially allows to represent any Latin alphabet character, including those found in Nordic, Spanish and other languages. The available size, is reduced, respectively to: 1120 / 8 = 140 characters exactly. That still is far from covering all languages. Russian, Chinese, Arabic, etc still are not covered. By using same method, and encoding test with UCS-2 (which is 16-bit encoder), it is now possible to cover most of the popular languages of the world. This is the most widely used standard when typing non-ASCII and/or non-Latin messages. The cost of that: message size is reduced further to 1120 / 16 = 70 characters only. 

2. Longer messages

Now it becomes evident that there must be a way to send longer messages. Even 160 characters was not much, but for some languages 70 is absolutely insufficient. What about if we seamlessly split longer messages behind the scenes, transmit in separate parts, and concatenate at the receiving side. This way does not require modifications of transmission infrastructure, which does not change so frequently as user handsets and also cost much more to upgrade or replace. Implementation seems obvious, except for the fact that there is no field or indicator in the message itself which can signal that a given message is part of multi-part message and should be reassembled when received. The way to solve it, was to "eat" a small part from the beginning of message itself, and use it as a special header which would describe what kind of message it is. It is called User Data Header (UDH) , and apart from telling the receiver side that this is part X of multipart message, it has some more functions, which we will not touch here. The resulting approach, reduces the length of each part of concatenated message by at least some 48 bits, so the resulting message lengths per part are following:

For 7-bit encoded message: 160 for complete message, 153 for a part of multipart message ( 7 x 7 = 49 bits less )

For 8-bit encoded message: 140 for complete message, 134 for a part of multipart message ( 6 x 8 = 48 bits less )

For 16-bit encoded message: 70 for complete message, 67 for a part of multipart message ( 3 x 16 = 48 bits less )

 

{$image1}

 

Technically it solves the problem of sending messages of arbitrary length in almost any language of the world. The infrastructure does not change, it can be transparent to the content. In most cases that is the case, as we see that most mobile operators charge by message parts, regardless of how many actual characters are send through.

Below just some examples of encoding of few letters:

Letter DescriptionUTF-16UTF-8GSM 03.38 (7-bit)
     
ñSpanish small n with tildeU+00F10xC3 0xB1 (c3b1)0x7D
áSmall a acuteU+00E10xC3 0xA1 (c3a1)Not present, available via shift table + 0x61

 

Reference:

GSM03.38 page on Wikipedia

 

 

 

Share this article

Aivis Olsteins

Aivis Olsteins

An experienced telecommunications professional with expertise in network architecture, cloud communications, and emerging technologies. Passionate about helping businesses leverage modern telecom solutions to drive growth and innovation.

Related Articles

Case Study: Global Communications Company

Case Study: Global Communications Company

A leading communications company used our cloud Voice platform to send 30 million OTP calls per month to their customers, resulting in cost reduction and incrased conversion

Read Article
Bridging The Delay Gap in Conversational AI: The Backpressure Analogy

Bridging The Delay Gap in Conversational AI: The Backpressure Analogy

Conversational AI struggles with the time gap between text generation and speech synthesis. A “backpressure” mechanism, akin to network data flow control, could slow text generation to match speech synthesis speed, improving user interaction.

Read Article
How Voice AI Agents Can Automate Outbound Calls and Unlock New Opportunities for Businesses: A Deeper Dive

How Voice AI Agents Can Automate Outbound Calls and Unlock New Opportunities for Businesses: A Deeper Dive

AI voice agents transform healthcare scheduling by reducing costs, administrative tasks, and no-shows. They offer 24/7 service, multilingual support, proactive reminders, and valuable insights, improving efficiency and patient experiences.

Read Article
How to Fix Your Context: Mitigating and Avoiding Context Failures in LLMs

How to Fix Your Context: Mitigating and Avoiding Context Failures in LLMs

Larger context windows in LLMs cause poisoning, distraction, confusion, and clash. Effective context management (RAG, pruning, quarantine, summarization, tool loadouts, offloading) remains essential for high-quality outputs.

Read Article

SUBSCRIBE TO OUR NEWSLETTER

Stay up to date with the latest news and updates from our telecom experts