Tuesday, January 5, 2010

Base64 Encoding

Base64 Encoding

When text files are attached to SMTP emails, the text files can be attached in their plain text format. But binary files cannot be attached without some form of encoding. This encoding used in SMTP and many other Internet Protocols is called Base64 Encoding or simply MIME Base64.
 In VC++ binary data is stored in BYTE (unsigned char) which is Base256 (0 … 255) encoding and in not easily readable and printable. The base64 encoding use the character set encoding common to most environments and easy to understand/print.
Base64 uses 64 (0 … 63) characters where each character is using 6 bits.  The character set representation is as follows:
‘A’…’Z’ for (0-25),  ‘a’…’z’ for (26-51), ‘0’…’9’ for (52-61), ‘+’ for 62 and ‘/’ for 63 - making it 64 in total. ‘=’ character is used for padding in addition.
BYTE (unsigned char) Array vs. Base64 Encoded String
“I have a buffer as BYTE array. I want to convert it into human readable/printable form. I need data representation protocol and that is Base64 encoding. You can store this date in text, XML, or application configuration files.”
One character in C++ is minimum storage unit of 8 bits. A Base64 Encoded ASCII character represents 6 bits of Base2 (binary system).  So least common multiple of 8 and 6 is 24 which means
3 Bytes = 4 Base64 encoded characters = 24 bits in Base2
So the encoded value of SoS is U29T. Encoded in ASCII, S, o, S are stored as the bytes 83, 111, 83, which are 01010011, 01101111 and 01010011 in base 2. These three bytes are joined together in a 24 bit (24 = 8x3) buffer producing 010100110110111101010011. Packs of 6 bits are converted into 4 numbers (24 = 6x4) which are then converted to their corresponding values in Base 64.
Text
S
o
S
ASCII
83
111
83
Base2 Pattern (24 bits)
01010011
01101111
01010011
6 Bit Pattern
010100
110110
111101
010011
Index
20
54
61
19
Base64 Encoded ASCII
U
2
9
T

Above example shows that Base 64 encoding converts 3 uncoded BYTEs (simple ASCII characters) into 4 encoded ASCII characters. So Base64 encoded string is almost 4/3 (=1.33) times larger than that of corresponded simple ASCII string. The base64 encoding algorithm takes every three bytes of data and converts them into four bytes of printable encoded ASCII characters. If the size of the incoming byte array is not an exact multiple of three, the algorithm appends equal signs (one for each missing byte) at the end of the base64-encoded string. So there can be 0, 1 or 2 number of ‘=’ signs at the end depending upon the number zero 6-bits pattern found in last three-octet group of original string. This convention guarantees that the size of base64-encoded string will always be a multiple of four. The length of a base64-encoded string can be calculated as:
Base64 = ((Bytes + 3 - (Bytes % 3)) /3) x 4

where Base64 and Bytes indicate the number of bytes in the base64-encoded string and the original byte array respectively. You can use this formula to calculate the size of the column holding base64-encoded text.
For example, if a byte array contains 13 characters of the ASCII string "Hello, world!", the size of the corresponding base64-encoded string can be calculated as:

Base64 = ((13 + 3 - (13 % 3)) / 3) x 4 = 20 (bytes)

The resulting value will be "SGVsbG8sIHdvcmxkIQ==". The last two characters of the base64-encoded string contain two equal signs ("==") indicating the two missing bytes in the last three-byte block of the byte array.
 
Padding
The first case where you have one byte remaining, you should pad two additional bytes with all zeros onto the end of the binary sequence. You can then represent the one byte with two base-64 characters followed by two padding characters.
Let consider an example.
‘00000001’
Pad the single-byte instance with two more bytes of zeros.
‘00000001’ ‘00000000’ ‘00000000’
Now break up the binary sequence in sets of six bytes.
‘000000’ ‘010000’ ‘000000’ ‘000000’
Take the first two base-64 characters and pad two ‘=’ characters to the end of the sequence.
‘AQ==’
The second case is where you have two bytes remain.
‘00000010’ ‘00000001’
Here you should pad one additional zero byte to the end of the binary sequence.
‘00000010’ ‘00000001’ ‘00000000’
Now break up the binary sequence in sets of six bytes.
‘000000’ ‘100000’ ‘000100’ ‘000000’
We then take three base-64 characters and pad with one ‘=’ sign.
‘AgE=’
Line Length
To improve human readability in the stream the base64 specification requires that each line should be at most of 76 encoded base-64 characters in length. After each 76 characters, we should insert a carriage return and line feed (\r\n) into the stream. This increases the stream length by approximately 3%.
 
References:
Base64 on Wikipedia
 
How to Base64 by Randy Charles Morin
 
How to Calculate the Size of Encrypted Data?

No comments:

Post a Comment