Ascii85: 두 판 사이의 차이

내용 삭제됨 내용 추가됨
TedBot (토론 | 기여)
잔글 봇: 문단 이름 변경 (History → 역사)
장기간 방치된 미번역 제거
16번째 줄:
 
Ascii85의 한 가지 단점은 인코딩 데이터가 backslash나 quote 같은 [[:en:escape character]]를 포함할 수 있다는 것인데, 이것들은 많은 프로그래밍 언어나 일부 문자 기반의 프로토콜에서 특별한 의미를 가지므로 문제가 될 수 있다. Z85는 소스 코드에서도 안전하도록 설계된 것이다. <ref>[http://rfc.zeromq.org/spec:32 "Z85 - ZeroMQ Base-85 Encoding Algorithm"]</ref>
 
== 역사 ==
 
===btoa version===
The original btoa program always encoded full groups (padding the source as necessary), with a prefix line of "xbtoa Begin", and suffix line of "xbtoa End", followed by the original file length (in decimal and [[hexadecimal]]) and three 32-bit [[checksum]]s. The decoder needs to use the file length to see how much of the group was padding. The initial proposal for btoa encoding used an encoding alphabet starting at the ASCII space character through "t" inclusive, but this was replaced with an encoding alphabet of "!" to "u" to avoid "problems with some mailers (stripping off trailing blanks)."<ref>{{웹 인용|last1=Orost|first1=Joe|title=Re: COMPRESSING of binary data into mailable ASCII Re: Encoding of binary data into mailable ASCII|url=https://groups.google.com/forum/#!original/comp.compression/Ve7k8XF-F5k/gBWfpyL-gfgJ|website=Google Groups|accessdate=11 April 2015}}</ref> This program also introduced the special "<code>z</code>" short form for an all-zero group. Version 4.2 added a "<code>y</code>" exception for a group of all ASCII [[space (punctuation)|space]] characters (0x20202020).
 
===ZMODEM version===
"ZMODEM Pack-7 encoding" encodes groups of 4 octets into groups of 5 printable ASCII characters, similar to Ascii85 (or perhaps exactly the same?). When [[ZMODEM]] programs send pre-compressed 8-bit data files over [[8-bit clean|7-bit data channels]], it uses "ZMODEM Pack-7 encoding".<ref> Chuck Forsberg. [http://www.omen.com/zmdmwn.html "Recent Developments in ZMODEM"]. "ZMODEM Pack-7 packs 4 bytes into 5 printing characters."</ref>
 
===Adobe version===
Adobe adopted the basic btoa encoding, but with slight changes, and gave it the name Ascii85. The characters used are the ASCII characters 33 (!) through 117 (u) inclusive (to represent the base-85 digits 0 through 84), together with the letter z (as a special case to represent a 32-bit 0 value), and white space is ignored. Adobe uses the delimiter "<code>~></code>" to mark the end of an Ascii85-encoded string, and represents the length by truncating the final group: If the last block of source bytes contains fewer than 4 bytes, the block is padded with up to three null bytes before encoding. After encoding, as many bytes as were added as padding are removed from the end of the output.
 
The reverse is applied when decoding: The last block is padded to 5 bytes with the Ascii85 character "<code>u</code>", and as many bytes as were added as padding are omitted from the end of the output (see example).
 
NOTE: The padding is not arbitrary. Converting from binary to base 64 only regroups bits and does not change them or their order (a high bit in binary does not affect the low bits in the base64 representation). In converting a binary number to base85 (85 is ''not'' a power of two) high bits do affect the low order base85 digits and conversely. Padding the binary low (with zero bits) while encoding and padding the base85 value high (with 'u's) in decoding assures that the high order bits are preserved (the zero padding in the binary gives enough room so that a small addition is trapped and there is no "carry" to the high bits).
 
<!-- TODO: Wikify and summarize the following paragraphs. This is a nice explanation, but the style does not fit Wikipedia, and it is way too long compared to the rest of the article. Ideally we could link to an explanation like this outside of Wikipedia. -->
<!--
In sending bits/bytes requiring K bytes for complete/lossless encoding and only having L bytes one can pad with (K-L) padding bytes P send the full set of K bytes and an indicator of how many should be dropped. The recipient can decode and get a result (including the padding bytes) and use the indicator select and drop them. Very easy and it is how I would have done base64 and base85 encoding.
 
Consider the following:
 
Pad the L bytes with padding K-L padding bytes P. Encode that. Don't send them all, but send a shorter version (since you don't need the lower part to calculate the upper part). The recipient can't decode (since he needs the full allocation of material to decode) so he pads with bytes Q (to have enough to decode to something) and decodes this to a string of K bytes. He takes the first L bytes from this string as the decoded version. Is this the same as the original string of L bytes.
 
<pre>
base64:
 
Encodes three bytes into four base64 digits.
 
Assume one encodes two bytes. Ooops! Too short.
Suppose one pads this with an arbitrary byte.
 
In the bit pattern:
 
x x x x x x x x|x x x x x x x x|p p p p p p p p
BYTE_1 BYTE_2 BYTE_3
 
we have the padding bits, p, can be arbitrary.
 
Convert to base64. This is very easy as 64 is a power of
two as is 256 (bytes are just base256 numbers).
 
x x x x x x|x x x x x x|x x x x p p|p p p p p p
DIGIT_1 DIGIT_2 DIGIT_3 DIGIT_4 <-- base 64 "digits"
 
where the digits are base64 digits (which can be displayed
when taken from a selected set of 64 symbols).
 
Send only the first THREE base64 digits of the endoded material:
 
x x x x x x|x x x x x x|x x x x p p
DIGIT_1 DIGIT_2 DIGIT_3
 
The recipient pads with any (arbitrary) base64 digit
 
x x x x x x|x x x x x x|x x x x p p|q q q q q q
DIGIT_1 DIGIT_2 DIGIT_3 DIGIT_4
 
and decodes to binary
 
x x x x x x x x|x x x x x x x x|p p q q q q q q
BYTE_1 BYTE_2 BYTE_3
 
and keeps the first two bytes only. Are they the same as
the original? Yes, no matter what the p's and q's are since
they are just added and dropped.
 
base85:
 
Encodes four bytes into five base85 digits.
 
Assume one encodes three bytes. Ooops! Too short.
Suppose one pads this with an arbitrary byte
In the bit pattern:
 
x x x x x x x x|x x x x x x x x|x x x x x x x x|p p p p p p p p
BYTE_1 BYTE_2 BYTE_3 BYTE_4
 
we have the padding bits, p, which can be arbitrary.
 
Convert to base85. This is not so easy as 85 is NOT a power
of two as is 256 (bytes are just base256 numbers).
 
DIGIT_1 DIGIT_2 DIGIT_3 DIGIT_4 DIGIT_5
 
NOT SO EASY NOW! The high bits in the original data can
affect the low base85 digits and the low base85 digits
can affect the HIGH BITS!
 
Only send the first four base85 digits.
 
DIGIT_1 DIGIT_2 DIGIT_3 DIGIT_4
 
The recipient pads this with a base85 digit (Q) and
decodes this to binary (or base256 if you like to think
of bytes instead of bits):
 
DIGIT_1 DIGIT_2 DIGIT_3 DIGIT_4 Q
 
b b b b b b b b|b b b b b b b b|b b b b b b b b|b b b b b b b b
 
and keeps the first three bytes. Are they the same as the
original three bytes? NOT NECESSARILY.
 
In base64, as converting from base256 (bytes) to base 64 (bits)
just regroups bits, the values of the bits do not change.
 
In base85 a change in the lowest base85 digit CAN affect the
high bits if the padding is wrongly done.
 
EXAMPLE:
 
Encoding ASC(128) ASC(0) ASC(0) (three bytes - the two last being
nulls) and padding the binary with zero bits:
 
1 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0
BYTE_1 BYTE_2 BYTE_3
 
PADDED:
 
1 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0
 
WRITTEN IN BASE85: (encoded)
 
DIGIT_1 DIGIT_2 DIGIT_3 DIGIT_4 DIGIT_5
 
I am not going to specify these but simply note DIGIT_5 CANNOT
BE ZERO! (or else the number would be divisible by 85 but it is
equal to 2^31)
 
SEND THE FIRST FOUR base85 DIGITS:
 
DIGIT_1 DIGIT_2 DIGIT_3 DIGIT_4
 
The recipient pads ... let me use ZERO (base85 digit) to pad:
 
DIGIT_1 DIGIT_2 DIGIT_3 DIGIT_4 0
 
and converts back to binary.
 
b b b b b b b b|b b b b b b b b|b b b b b b b b|b b b b b b b b
 
This is base85 encoding (NOT base64) the low order base85 digit
of the number 2^31 is NOT ZERO (DIGIT_5 is not zero) and the recipient
replaced it with zero so he has a SMALLER number (smaller than 2^31) and
so when he converts to binary, the high order bit is ZERO - NOT ONE!
 
It is not true that one can use arbitrary paddings in encoding/decoding
this time.
 
TRICK:
 
Always pad the binary LOW (use zeros) and pad the base85 number HIGH
(use 84 as the base85 digit for padding).
 
What happens in that case?
 
In that case the recipient has padded with 84.
 
DIGIT_1 DIGIT_2 DIGIT_3 DIGIT_4 84
 
and decodes:
 
b b b b b b b b|b b b b b b b b|b b b b b b b b|b b b b b b b b
 
This number is not the original and, in fact, is LARGER THAN
THE ORIGINAL BINARY NUMBER. How much larger? At most
84+the_original
(you changed the last base85 digit to its maximum).
 
This number is then
 
1 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0
+ X (0 <= X <= 84)
-----------------------------------------------------------------
1 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|t t t t t t t t
^^^^^^^^^^^^^^^
X ONLY AFFECTS THESE
(and not even the first one!)
</pre>
 
where X is at most 84. NOTE THAT WE PADDED THE BINARY WITH ZEROS! THERE WILL BE NO CARRY WHEN WE ADD X (smaller than 256) BEYOND THE LOWEST BYTE DUE TO THE NUMBER OF LOW ORDER ZERO PADDIING BITS IN THE ORIGINAL NUMBER TO WHICH X IS ADDED.
 
As long as we pad the binary with zeros and the base85 with 84s, since 85^n < 256^n we will decode to a larger number (with some low order bits) BUT WE HAVE PADDED WITH ENOUGH ZEROS SO ADDING THE EXTRA TO THE ORIGINAL BINARY ONLY AFFECTS THE PADDED BITS (there is NO CARRY INTO THE BITS WE WANT TO KEEP).
 
You can see similarly that padding the binary high (with 1 bits) and the base85 character high (84s) will cause problems.
 
So ... the padding for decoding has to be matched with the padding for encoding in base85 since high bits affect low base85 digits and conversely - which does not occur with base64 encoding.
-->
 
In Ascii85-encoded blocks, whitespace and line-break characters may be present anywhere, including in the middle of a 5-character block, but they must be silently ignored.
 
Adobe's specification does not support the "<code>y</code>" exception.
 
===ZeroMQ Version (Z85)===
Z85, the [[ZeroMQ]] base-85 encoding algorithm, is a string-safe variant of base85. By avoiding the double-quote, single-quote, and backslash characters, Z85-encoded data can be better embedded in [[command-line interpreter]] strings. Z85 uses the characters <tt>0</tt>...<tt>9</tt>, <tt>a</tt>...<tt>z</tt>, <tt>A</tt>...<tt>Z</tt>, <tt>.</tt>, <tt>-</tt>, <tt>:</tt>, <tt>+</tt>, <tt>=</tt>, <tt>^</tt>, <tt>!</tt>, <tt>/</tt>, <tt>*</tt>, <tt>?</tt>, <tt>&#38;</tt>, <tt>&lt;</tt>, <tt>&gt;</tt>, <tt>(</tt>, <tt>)</tt>, <tt>&#91;</tt>, <tt>&#93;</tt>, <tt>&#123;</tt>, <tt>&#125;</tt>, <tt>@</tt>, <tt>%</tt>, <tt>$</tt>, <tt>#</tt>.<ref>Pieter Hintjens [http://rfc.zeromq.org/spec:32 RFC 32/Z85 - ZeroMQ Base-85 Encoding Algorithm]</ref>
 
===Example for Ascii85===
<!-- The (historic) slogan of [[Wikipedia]]: -->
<!-- Following is just as it appears on the Wikipedia "Base64" page -->
A quote from [[Thomas Hobbes|Thomas Hobbes's]] ''[[Leviathan (book)|Leviathan]]'':
 
: ''Man is distinguished, not only by his reason, but by this singular passion from other animals, which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable generation of knowledge, exceeds the short vehemence of any carnal pleasure.''
 
If this is initially encoded using US-ASCII, it can be reencoded in Ascii85 as follows:
 
<pre>
<~9jqo^BlbD-BleB1DJ+*+F(f,q/0JhKF<GL>Cj@.4Gp$d7F!,L7@<6@)/0JDEF<G%<+EV:2F!,
O<DJ+*.@<*K0@<6L(Df-\0Ec5e;DffZ(EZee.Bl.9pF"AGXBPCsi+DGm>@3BB/F*&OCAfu2/AKY
i(DIb:@FD,*)+C]U=@3BN#EcYf8ATD3s@q?d$AftVqCh[NqF<G:8+EV:.+Cf>-FD5W8ARlolDIa
l(DId<j@<?3r@:F%a+D58'ATD4$Bl@l3De:,-DJs`8ARoFb/0JMK@qB4^F!,R<AKZ&-DfTqBG%G
>uD.RTpAKYo'+CT/5+Cei#DII?(E,9)oF*2M7/c~>
</pre>
 
{| class="wikitable"
| Text content
| colspan="8" align="center"| '''M'''
| colspan="8" align="center"| '''a'''
| colspan="8" align="center"| '''n'''
| colspan="8" align="center"| ''' '''
| align="center"| ...
| colspan="8" align="center"| '''s'''
| colspan="8" align="center"| '''u'''
| colspan="8" align="center"| '''r'''
| colspan="8" align="center"| '''e'''
|-
| ASCII
| colspan="8" align="center"| 77
| colspan="8" align="center"| 97
| colspan="8" align="center"| 110
| colspan="8" align="center"| 32
| align="center"| ...
| colspan="8" align="center"| 115
| colspan="8" align="center"| 117
| colspan="8" align="center"| 114
| colspan="8" align="center"| 101
|-
| Bit pattern ||0||1||0||0||1||1||0||1||0||1||1||0||0||0||0||1||0||1||1||0||1||1||1||0||0||0||1||0||0||0||0||0|0
| align="center"| ...
||0||1||1||1||0||0||1||1||0||1||1||1||0||1||0||1||0||1||1||1||0||0||1||0||0||1||1||0||0||1||0||1
|-
| 32-bit Value
| colspan="32" align="center"| 1,298,230,816 = 24×85<sup>4</sup> + 73×85<sup>3</sup> + 80×85<sup>2</sup> + 78×85 + 61
| align="center"| ...
| colspan="32" align="center"| 1,937,076,837 = 37×85<sup>4</sup> + 9×85<sup>3</sup> + 17×85<sup>2</sup> + 44×85 + 22
|-
| Base 85 (+33)
| colspan="6" align="center"| 24 (57)
| colspan="7" align="center"| 73 (106)
| colspan="6" align="center"| 80 (113)
| colspan="7" align="center"| 78 (111)
| colspan="6" align="center"| 61 (94)
| align="center"| ...
| colspan="6" align="center"| 37 (70)
| colspan="7" align="center"| 9 (42)
| colspan="6" align="center"| 17 (50)
| colspan="7" align="center"| 44 (77)
| colspan="6" align="center"| 22 (55)
|-
| ASCII
| colspan="6" align="center"| 9
| colspan="7" align="center"| j
| colspan="6" align="center"| q
| colspan="7" align="center"| o
| colspan="6" align="center"| ^
| align="center"| ...
| colspan="6" align="center"| F
| colspan="7" align="center"| *
| colspan="6" align="center"| 2
| colspan="7" align="center"| M
| colspan="6" align="center"| 7
|}
 
마지막으로 남은 것이 4개가 아닌 경우 0으로 채운다.
{| class="wikitable"
| Text content
| colspan="8" align="center"| '''.'''
| colspan="8" align="center"| ''\0''
| colspan="8" align="center"| ''\0''
| colspan="8" align="center"| ''\0''
|-
| ASCII
| colspan="8" align="center"| 46
| colspan="8" align="center"| 0
| colspan="8" align="center"| 0
| colspan="8" align="center"| 0
|-
| Bit pattern
||0||0||1||0||1||1||1||0||0||0||0||0||0||0||0||0||0||0||0||0||0||0||0||0||0||0||0||0||0||0||0||0
|-
| 32-bit Value
| colspan="32" align="center"| 771,751,936 = 14×85<sup>4</sup> + 66×85<sup>3</sup> + 56×85<sup>2</sup> + 74×85 + 46
|-
| Base 85 (+33)
| colspan="6" align="center"| 14 (47)
| colspan="7" align="center"| 66 (99)
| colspan="6" align="center"| 56 (89)
| colspan="7" align="center"| 74 (107)
| colspan="6" align="center"| 46 (79)
|-
| ASCII
| colspan="6" align="center"| /
| colspan="7" align="center"| c
| colspan="6" align="center"| ''Y''
| colspan="7" align="center"| ''k''
| colspan="6" align="center"| ''O''
|}
 
0으로 채운 3개만큼 인코딩된 마지막 3문자 'YkO'는 빼버린다.
 
디코딩의 경우에는 거꾸로 하는데, 채우는 문자를 'u'로 하는 것만 다르다:
{| class="wikitable"
| ASCII
| colspan="6" align="center"| /
| colspan="7" align="center"| c
| colspan="6" align="center"| ''u''
| colspan="7" align="center"| ''u''
| colspan="6" align="center"| ''u''
|-
| Base 85 (+33)
| colspan="6" align="center"| 14 (47)
| colspan="7" align="center"| 66 (99)
| colspan="6" align="center"| 84 (117)
| colspan="7" align="center"| 84 (117)
| colspan="6" align="center"| 84 (117)
|-
| 32-bit Value
| colspan="32" align="center"| 771,955,124 = 14×85<sup>4</sup> + 66×85<sup>3</sup> + 84×85<sup>2</sup> + 84×85 + 84
|-
| Bit pattern
||0||0||1||0||1||1||1||0||0||0||0||0||0||0||1||1||0||0||0||1||1||0||0||1||1||0||1||1||0||1||0||0
|-
| ASCII
| colspan="8" align="center"| 46
| colspan="8" align="center"| 3
| colspan="8" align="center"| 25
| colspan="8" align="center"| 180
|-
| Text content
| colspan="8" align="center"| '''.'''
| colspan="8" align="center"| ''[ [[End-of-text character|ETX]] ]''
| colspan="8" align="center"| ''[ EM ]''
| colspan="8" align="center"| ''&#180; ([[Extended ASCII]])''
|}
 
3개의 'u' 문자를 채워졌으므로 마지막 3 바이트를 뺀다.
 
The input sentence does not contain 4 consecutive zero bytes, so the example does not show the use of the 'z' abbreviation.
 
===Compatibility===
The Ascii85 encoding is compatible with 7-bit and 8-bit [[MIME]], while having less overhead than [[Base64]].
 
One potential compatibility issue of Ascii85 is that 'single' and "double" quotation marks, <angle> brackets, and ampersands (&) cannot be used unescaped in markup languages like XML or SGML.
 
==<nowiki>RFC 1924</nowiki> version==
Published on [[April Fools' Day Request for Comments|April 1, 1996]], informational RFC 1924: "A Compact Representation of IPv6 Addresses" by [[Kevin Robert Elz|Robert Elz]] suggests a base-85 encoding of [[IPv6]] addresses. This differs from the scheme used above in that he proposes a different set of 85 ASCII characters, and proposes to do all arithmetic on the 128-bit number, converting it to a single 20-digit base-85 number (internal whitespace not allowed), rather than breaking it into four 32-bit groups.
 
The proposed character set is, in order, <code>0</code>–<code>9</code>, <code>A</code>–<code>Z</code>, <code>a</code>–<code>z</code>, and then the 23 characters <code>!#$%&amp;()*+-;<=>?@^_`{|}~</code>. The highest possible representable address, 2<sup>128</sup>−1&nbsp;= 74×85<sup>19</sup>&nbsp;+ 53×85<sup>18</sup>&nbsp;+ 5×85<sup>17</sup>&nbsp;+ ..., would be encoded as <code>=r54lj&amp;NUUO~Hi%c2ym0</code>.
 
While the RFC chose a different character set in order to prevent the use of certain problematic characters <code>"',./:[\]</code>, it still requires escaping for SGML-based protocols, notably for XML.
 
==같이 보기==