14-Mar-88 03:18:32-EST,1904;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:A-PIRARD@BLIULG11.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Mon 14 Mar 88 03:18:18-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Mon, 14 Mar 88 03:17:59 EST Received: from VM1.ULG.AC.BE by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 7381; Mon, 14 Mar 88 03:17:58 EST Received: by BLIULG11 (Mailer X1.25) id 7697; Mon, 14 Mar 88 09:15:55 +0100 Date: Mon, 14 Mar 1988 08:45:21 +0100 From: Andre PIRARD Subject: ASCII, ISO and which EBCDIC? To: Info-IBMPC Digest c/o Gregory Hicks COMFLEACTS , IBM-KERMIT@CU20B.COLUMBIA.EDU, Protocol Converter list , LINKFAIL@FRULM11, Columbia University Center for Computing Activities We, ASCII or EBCDIC network users must pay particular attention to character codes standards, now extending to international. Even sites not interested in in international characters will sooner or later hit the problem because, albeit the situation is straight in the ASCII world with an ISO standard, it is far from that for EBCDIC users faced to a choice of several codes whose differences lies on a few codes, strangely enough not international. The subject is discussed on a mailing list set up by Edwin Hart. Joining with: TELL LISTSERV AT JHUVM SUB ISO8859 user-name Or sending a note on BITNET to: LISTSERV AT JHUVM Containing: SUB ISO8859 user-name can help the community agree on a viable single code or at least help each site in finding its most appropriate one and save everybody's time and money. I'll soon post a summary of the problem to that list. Please forward this note to anybody interested. 22-Mar-88 13:31:54-EST,21373;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 22 Mar 88 13:31:43-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 22 Mar 88 13:32:04 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0746; Tue, 22 Mar 88 13:32:01 EDT Received: by BITNIC (Mailer X1.24) id 0743; Tue, 22 Mar 88 13:21:56 EDT Date: Tue, 15 Mar 88 11:17:07 EST Reply-To: Discussion list for ASCII/EBCDIC character set related issues Sender: Discussion list for ASCII/EBCDIC character set related issues From: Edwin Hart Subject: Some Important Comments from Howard Gilbert at Yale University To: Frank da Cruz Enclosed are some comments Howard Gilbert wrote after the SHARE 69 meeting held in August, 1987. It is a good summary. Unless otherwise stated, "EBCDIC" means "U.S./Canada English EBCDIC". IBM is very interested in the issues surrounding ASCII, EBCDIC characters sets --particularly as they relate to National Character Set Issues (See the SHARE European Association (SEAS) "White Paper on national character, language and keyboard problems", September, 1985) and the System Application Architecture. The Michigan Terminal System (MTS) community has implemented an ISO 8859-1 to EBCDIC Code Page 37, Version 1 conversion already. (Brian Eliot IBM also is trying to decide between two code pages to use as a single EBCDIC for ISO Latin Alphabet number 1: Code Page 37, version 1 (which is data processing oriented) and Code Page 500, version 1 (which is word processing oriented). Ed Hart Date: Thu, 03 Sep 87 11:45:39 EST From: Howard Gilbert To: HART@APLVM I think that what we learned at SHARE is important to a lot of people. Besides our local users, it should be directed to: BITNET technical reps (for ASCII BITNET nodes) NOTIS groups (library automation) 7171 protocol converter list Here is a first draft of the kind of thing I am thinking of sending out: I attended the meetings of the ASCII-EBCDIC Translate Table committee at SHARE 69 (Chicago) Aug. 23-28. As someone who has struggled with this question for 15 years, I was stunned by the amount of progress which has been made almost overnight. There is no solution which will be satisfactory to everyone, but a solution is now visible and requires only the willingness to adopt it. Because of the problem, I will discuss all characters by name or code value and not attempt to print them. I will try to make the presentation as short as possible while making it complete. HISTORY: ASCII was standardized after EBCDIC. With no national standard in place, and requirements for BCD compatibility, IBM extemded its 6 bit character set BCD to an 8 bit code by moving bits rather than translating with a table. Both standards made, what is in hindsight, some regrettable mistakes. The ASCII committee placed too much emphasis on the TTY 33 which had no lowercase letters and lacked braces and vertical bar. The choice of EBCDIC printable characters was influenced by the number of characters which could be placed on a Selectric typewriter ball. The ANS committee (not IBM) recommended that ASCII exclamation point be regarded as EBCDIC vertical bar and that ASCII Circumflex be treated as EBCDIC not. These were "stylistic differences" in the 1968 standard. EBCDIC has a cent sign which ASCII did not match, and ASCII has brackets, braces, backslash, accent, and tilde which EBCDIC did not originally position in its tables. There were some IBM print bands (notably TN and ALA) which included some other characters, but they do not constitute a standard nor are they the basis for one. Both ASCII and EBCDIC have "national use" characters which can be replaced in other countries by local graphics. National standards bodies in most European countries have chosen specific graphics for the ASCII positions, and IBM has copied these choices to the EBCDIC positions. Technically "ASCII" refers to the USA version, but everyone uses the term to refer to all the ISO 646 standard character sets which are similar except for national use positions. IBM does not change the compiler. Therefore, a C program will print rather strangely in Germany where the German standard replaces backslash with O-umlaut. Of course, the programmers in most countries continue to order terminals and printers with the US character set or to make that set available as a special printer setup. It is interesting to note that PASCAL was, after all, originally developed in Switzerland where the offical national languages are French, German, and Italian. At the time that Yale ASCII was shipped, IBM had no strong definition as to the position of backslash, braces, and brackets in the EBCDIC set. Some mistakes were made (in hindsight) which were then perpetuated in the 7171. However, the translate tables are installation configuration options. One approach to extending the character set is to form diacritically marked characters using true overstrikes. In other words, there is a key marked with an umlaut. Press the key and an umlaut appears but the cursor does not advance. Press "O" and what displays is O-umlaut and the cursor advances. This is the model used by the CCITT for European telex and by MARC and library systems (such as the ALA character feature on the IBM 316x). It is also effectively used by the NOTIS and DOBIS library systems. It has never been accepted by data processing equipment manufacturers, who have instead pressed for an entirely different form of character set extension based on one character per code position (i.e., each accented character occupies one code position). TECHNOLOGY TO THE RESCUE. There would probably never have been a solution to this problem without the elimination of some constraints. It may be that there are many devices today which will not be able to use the solution, but this is a long term problem which will be solved over years as old devices disappear. The important change is that microprocessor technology in terminals and non-impact pageprinters both make it possible to extend the generally available number of characters from the old 96 to 194. ISO (the International Standards Organization) has adopted ISO 8859/1 and ANSI (the US standards organization) is adopting the same standard under title ANSI X3.134.2. It provides a specific standard set of graphics for 8 bit ASCII code points X'A0' to X'FF'. Of particular significance is the assignment of cent sign to X'A2' and EBCDIC-not to X'AC'. So suddenly ASCII has all of the true characters typically found on an IBM terminal. Of course, IBM always had an 8 bit character set with many unused positions. However, there are only so many keys on the keyboard or positions on the print band. The PC and 3800 allow all of the possible positions to be filled in. Since IBM does pay careful attention to standards, the ISO development made it possible for them to create an internal standard for EBCDIC placement of the same 194 graphic symbols in the range of X'41' to X'FE'. IBM started with the 38xx page printer USA DP code assignments (see "Code Page T1GDP037" in IBM 3800 Printing Subsystem Model 3 Font Catalog SH35-0053 which I assume everyone has in his library) and then adjusted it with: middle dot at X'B4' copyright at X'B5' times/multiply at X'BF' special hyphen at X'CA' superscript 1 at X'DA' (replaces Turkish dotless small i) divide at X'E1' This will be referred to as the "code page 37" table. (In addition, IBM adjusted each country's EBCDIC to include the extra characters by filling the empty slots in the tables. (These are the Country Extended Code Pages, CECPs.) Note that while all the characters exist, code positions vary for each country's individual CECP.) The result is an implied 1-1 correspondence of 194 ASCII and US EBCDIC printable characters which in turn implies a translate table. All of this is possible because the technology has moved to microprocessors on most terminals and printers and expanded memory allows extended character sets. MOST PEOPLE DO NOT UNDERSTAND WHAT THIS REALLY MEANS. A code set assigns a graphic representation to a byte value. The "graphic representation" is most important when a file is printed or displayed on a terminal. The value X'4E' can be stored in binary in a FORTRAN program and can be copied from one variable to another without anyone caring what it means. Only when it is displayed do we determine if it should be "N" (ASCII) or "+" (EBCDIC). Now compilers do care about the difference between "N" and "+". Curiously enough, most ANSI language standards do not specify code points. FORTRAN, COBOL, PASCAL, PLI, and C all specify that "+" means addition. But none of the standards requires that plus have any particular binary value. IBM mainframes use EBCDIC "+" and most other computers use ASCII "+", but some systems place it at another location. VSAPL, for example, has an internal code set called "Z code" which rearranges characters for easier interpretation. Even then, the code which displays as plus still means addition. The problem is that graphic representation is a matter of taste. A certain amount of flexibility has to be left in to allow for italics, to let zero optionally have a bar through it or not, and to let the Europeans put a bar through their Z (pronounced "zed" over there). The standards allow "stylistic differences". In its most extreme case, however, ASCII exclamation point was regarded as a stylistic representation of EBCDIC vertical bar! (or should I say |) in ANSI X3.4-1966. These stylistic differences start to become a problem when the effect the selection of codes accepted by compilers, command processors, and other non-printer system components. They have then been allowed to gum up the translation between code sets. The naive user says that the standard EBCDIC code for "A" is "C1". A more correct statement is that the standard graphic reprsentation for X'C1' is "A". Other graphic representations exist for the code (look at the 3800 fonts manual for symbol fonts and consider Japanese and other languages). The point, however, is that most of the compilers, editors, and systems regard X'C1' as a letter. Actually no compiler cares what the human thinks that the letter is. Alright, there is a funny thing in FORTRAN that I-N are integers by default, but by and large the 26 letters of the alphabet are interchangable in forming names. Thus in some other country these letters could be replaced by the local alphabet as long as letters remain letters and punctuation remains punctuation. WARNINGS: So suddenly we have an "official" translate table between U.S. EBCDIC and ASCII. To do it, we had to go to a larger character set on both sides. In doing so we pick up the most important foreign language characters (as determined by ISO, not IBM). This can be supported on all laser printers, PCs, and character loadable devices. It may not work on older printers and terminals. DEC, for example, supports a subset of the ISO standard as its extended character set on its terminals. However, it is generally possible today to load fonts into all but the very ancient equipment. An IBM standard is an internal document. Its existence will force subsequent product developers to justify deviation from the standard, but will not prevent such deviations when a business case exists. Put another way, if IBM feels that there is still a market for band printers, technology will prevent the creation of a 194 character set for such a device. Given a smaller character set, IBM may have to deviate from the larger standard. However, when an organization has to make code translations, this new standard becomes the obvious starting point. There is no evidence that the compilers and other applications will be ready to deal with these EBCDIC assignments. In particular, for C and PASCAL which are defined for ASCII, the compilers must support: circumflex at X'B0' left bracket at X'BA' right bracket at X'BB' The compilers and other applications must recognize dual EBCDIC codes for some characters. Specific examples are: C: circumflex and not for "negation" vertical bar and split bar for "or" brackets at BA/BB and AD/BD code points braces at 8B/9B and C0/D0 code points "*" and new "x" for multiplication "/" and new divide for division PASCAL: circumflex and not and tilde for "negation" vertical bar and split bar for "or" brackets at BA/BB and AD/BD code points braces at 8B/9B and C0/D0 code points "*" and new "x" for multiplication "/" and new divide for division PL/I: circumflex and not and tilde for "negation" vertical bar and split bar for "or" "*" and new "x" for multiplication "/" and new divide for division REXX: circumflex and not and tilde for "negation" vertical bar and split bar for "or" Query Languages: Unknown. TELNET: (In DoD TCP/IP network) virtual terminal protocol must allow the installation to define the character to use for the CONTROL shift. Ideally, the installation would be able to define two code positions (e.g., cent for U.S. EBCDIC 3270s and left bracket for ASCII-7 character) compatibility. (You want a character that you seldom use. ASCII-7 terminals have no cent and EBCDIC-94 has no brackets). There is no standard for the translation of control characters. There are 65 ASCII control codes (X'00'-X'1F' and X'7F'-X'9F') and exactly the same number in EBCDIC (X'00-X'3F' and X'FF'). However, there is no official 1-1 translation. In the past there was a tendency for duplicated mappings (EBCDIC LF and NL were both commonly mapped to ASCII LF) so making changes will not be a trivial decision. ISSUES It is always difficult to know what the implications of a translate table change are going to be. It is necessary to try it and then see what happens. There are some old devices, like the 6670, for which change is extremely complex. Fortunately, the desktop publishing revolution and Postscript printers are making such old devices less important. At Yale, we have no special insight into the software. We will have to determine the impact of this new character mapping on PASCAL, C, PL/I, WSCRIPT, DCF, and other character sensitive products. However, we have unusual control over the communications area. Through YTERM 1.4 it is possible to define any PC to be an ISO 8859 terminal. It is also possible to load the EGA with a font that corresponds to ISO 8859 characters (eliminating a translation into the standard monochrome extended character set). By changing the translate tables in the Series/1, it is possible to build a pseudo-3270 display which supports all 194 new EBCDIC code points and displays them on an ISO 8859 terminal (like YTERM). In the near term, there will be some problems. Older ASCII-7 terminals will not support the ISO standard and will require the old translate tables to code PL/I. Some host compilers will not support the new ASCII positions and uploaded files may require a translation pass until the compilers are upgraded. Therefore, these changes would be installed only for experimental access until the full impact is determined. In the long run, however, these tables provide an interesting recommendation for BITNET file transfer. If this translation could be adopted (with specific control code mappings) by ASCII locations on BITNET then we could address a number of file interchange problems. However, even though the ISO 8859 is 95% identical to the DEC VAX extended character set, there will still have to be a comparable period of testing on the VMS and UNIX side to determine if the translation poses a problem on that end. The ISO 8859 standard has several parts. I have been talking about 8859/1 the standard for Roman character sets. There are other parts for Eastern European and presumably Russian, Arabic, Hebrew, and Japanese. Eventually these issues may also have to be addressed. IMPACT There are several areas of Yale activity which could be effected by this standard: The Computer Center would be directly effected if the standard is to be supported for general terminal access. For communications and terminal support, this involves creation of YTERM tables, changes to the frontend processors, and changes to PCTRANS and possibly TPRINT. At this time there would be no intention to change line-at-a-time communications support or the Datasouth printers since these devices do not support extended character sets. This is a subject for discussion and a long range objective. We also need to study the impact on the existing IBM compilers, REXX, and other applications. Yale will work through BITNET to get these standards adopted throughout the community. This will require the agreement of the rest of the university community. The university library community and the NOTIS package might consider the implications of this standard. The current approach based on the special ALA support for overstruck characters is available on a limited set of devices. This is an area of future discussion. The general community of word processing programs and users at Yale should take these code assignments into consideration when building fonts. This is a user activity and does not explicitly involve Computer Center personnel. University users who have in the past been interested in non-Roman character sets should investigate the implications of the other elements of the ISO 8859 standard. Unfortunately, Yale has no particular forum in which to discuss such changes. I would like to receive comments at GILBERT@YALEVM and will attempt to call a meeting to discuss the implications and implementation of table changes if the response warrants it. HERE IS IS, WITH WARNINGS. The following tables are presented for the purpose of discussion. They have not been checked for accuracy and are subject to amendment. It is, for example, rather difficult for an American to distinguish lowercase from uppercase "Islandic Thorn" especially in two entirely different type settings. Still, the only way to document things is to actually provide the tables. A curse upon anyone who actually puts them into production before the community as a whole agrees to them. ASCII TO EBCDIC TABLE (WITHOUT CONTROL CODES) * 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 00?????? ???????? ???????? ???????? 0 1 ???????? ???????? ???????? ???????? 1 2 405A7F7B 5B6C507D 4D5D5C4E 6B604B61 2 3 F0F1F2F3 F4F5F6F7 F8F97A5E 4C7E6E6F 3 4 7CC1C2C3 C4C5C6C7 C8C9D1D2 D3D4D5D6 4 5 D7D8D9E2 E3E4E5E6 E7E8E9BA E0BBB06D 5 6 79818283 84858687 88899192 93949596 6 7 979899A2 A3A4A5A6 A7A8A9C0 4FD0A1FF 7 8 ???????? ???????? ???????? ???????? 8 9 ???????? ???????? ???????? ???????? 9 A 41AA4AB1 9FB26AB5 BDB49A8A 5FCAAFBC A B 908FEAFA BEA0B6B3 9DDA9B8B B7B8B9AB B C 64656266 63679E68 74717273 78757677 C D AC69EDEE EBEFECBF 80FDFEFB FCADAE59 D E 44454246 43479648 54515253 58555657 E F 8C49CDCE CBCFCCE1 70DDDEDB DC8D8EDF F * 0 1 2 3 4 5 6 7 8 9 A B C D E F EBCDIC ASCII TABLE * 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 00?????? ???????? ???????? ???????? 0 1 ???????? ???????? ???????? ???????? 1 2 ???????? ???????? ???????? ???????? 2 3 ???????? ???????? ???????? ???????? 3 4 20A0E2E4 E0E1E3E5 E7F1A22E 3C282B7C 4 5 26E9EAEB E8EDEEEF ECDF2124 2A293BAC 5 6 2D2FC2C4 C0C1C3C5 C7D1A62C 254F3E2F 6 7 F8C9CACB C8CDCECF CC6D3A23 4D273D22 7 8 D8616263 64656667 6869ABBB F0FDFEA1 8 9 B06A6B6C 6D6E6F70 7172AABA E6B8C6A4 9 A B57E7374 75767778 797AA1BF D0DDDEAE A B 5EA3A5B7 A9A7B6BC BDBE5B5D AFA8B4D7 B C 7B414243 44454647 4849ADF4 F6F2F3F5 C D 7D4A4B4C 4D4E4F50 5152B9FB FCF9FAFF D E 5CF75354 55565758 595AB2D4 D6D2D3D5 E F 30313233 34353637 3839B3D8 DCD9DA7F F * 0 1 2 3 4 5 6 7 8 9 A B C D E F 22-Mar-88 20:55:10-EST,6359;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 22 Mar 88 20:55:02-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 22 Mar 88 20:55:15 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 1512; Tue, 22 Mar 88 20:55:14 EDT Received: by BITNIC (Mailer X1.24) id 3564; Tue, 22 Mar 88 20:49:19 EDT Date: Tue, 22 Mar 88 19:10:18 EST Reply-To: Discussion list for ASCII/EBCDIC character set related issues Sender: Discussion list for ASCII/EBCDIC character set related issues From: Mike_Alexander@um.cc.umich.edu Subject: Re: Some Important Comments from Howard Gilbert at Yale University X-To: ISO8859%JHUVM.BITNET@CUNYVM.CUNY.EDU To: Frank da Cruz I read with great interest the comments from Howard Gilbert at Yale regarding ISO8859 to EBCDIC translation. He captures the essence of the problem very well. As Edwin Hart indicated in his preface to these comments, the MTS community has already installed a ISO8859 to EBCDIC translate table as its standard for network access to the various machines involved. I compared the translate table that Howard Gilbert gave at the end of his message with the one we use (ignoring the control characters he didn't fill in) and found the following differences. ISO8859 Name in ISO8859 Gilbert MTS 7F DEL FF 07 DE Capital Thorn AE 8E E6 Small ae dipthong 96 9C FE Small Thorn 8E AE The code for E6 seems to be a typo, since ISO8859 code point 6F also translates into EBCDIC code point 96. The reverse table translates EBCDIC 96 (which is a lower case o) into ISO8859 6F which seems correct. We chose to translate ISO8859 code 7F into EBCDIC 07 since that has been defined as the DEL character in various IBM publications for some time. I didn't personally have much to do with this decision, so I'll let others justify it, but it seems to make sense. We seem to disagree about the difference between an upper case and a lower case Icelandic Thorn. I hope we're right, since our table is already installed. In case anyone is interested, here is the rest of our translate table. The codes for ISO8859 01 through 1F were chosen to correspond to existing EBCDIC control characters. I don't recall all the discussion behind the choice of the codes for ISO8859 80 to 9F, but these codes were chosen so that the entire table is one to one. I can dig up some of the discussion behind these choices if anyone cares. In the following, the name on the left gives the ISO8859 code and the value in quotes is the corresponding EBCDIC code. ITOE#01 DC X'01' SOH start of heading (Ctrl-A) ITOE#02 DC X'02' STX start of text (Ctrl-B) ITOE#03 DC X'03' ETX end of text (Ctrl-C) ITOE#04 DC X'37' EOT end of transmission (Ctrl-D) ITOE#05 DC X'2D' ENQ enquiry (Ctrl-E) ITOE#06 DC X'2E' ACK acknowledge (Ctrl-F) ITOE#07 DC X'2F' BEL bell (Ctrl-G) ITOE#08 DC X'16' BS backspace (Ctrl-H) ITOE#09 DC X'05' HT horizontal tabulation (Ctrl-I) ITOE#0A DC X'25' LF line feed (Ctrl-J) ITOE#0B DC X'0B' VT vertical tabulation (Ctrl-K) ITOE#0C DC X'0C' FF form feed (Ctrl-L) ITOE#0D DC X'0D' CR carriage return (Ctrl-M) ITOE#0E DC X'0E' SO shift-out (Ctrl-N) ITOE#0F DC X'0F' SI shift-in (Ctrl-O) * ITOE#10 DC X'10' DLE data link escape (Ctrl-P) ITOE#11 DC X'11' DC1 device control 1 (X-Off, Ctrl-Q) ITOE#12 DC X'12' DC2 device control 2 (Ctrl-R) ITOE#13 DC X'13' DC3 device control 3 (X-On, Ctrl-S) ITOE#14 DC X'3C' DC4 device control 4 (Ctrl-T) ITOE#15 DC X'3D' NAK negative acknowledge (Ctrl-U) ITOE#16 DC X'32' SYN synchronous idle (Ctrl-V) ITOE#17 DC X'26' ETB end of transmission block (Ctrl-W) ITOE#18 DC X'18' CAN cancel (Ctrl-X) ITOE#19 DC X'19' EM end of medium (Ctrl-Y) ITOE#1A DC X'3F' SUB substitute character (Ctrl-Z) ITOE#1B DC X'27' ESC escape (Escape) ITOE#1C DC X'1C' FS file separator ITOE#1D DC X'1D' GS group separator ITOE#1E DC X'1E' RS record separator ITOE#1F DC X'1F' US unit separator ITOE#80 DC X'20' ... ITOE#81 DC X'21' ... ITOE#82 DC X'22' ... ITOE#83 DC X'23' ... ITOE#84 DC X'24' ... ITOE#85 DC X'15' ... ITOE#86 DC X'06' ... ITOE#87 DC X'17' ... ITOE#88 DC X'28' ... ITOE#89 DC X'29' ... ITOE#8A DC X'2A' ... ITOE#8B DC X'2B' ... ITOE#8C DC X'2C' ... ITOE#8D DC X'09' ... ITOE#8E DC X'0A' ... ITOE#8F DC X'1B' ... * ITOE#90 DC X'30' ... ITOE#91 DC X'31' ... ITOE#92 DC X'1A' ... ITOE#93 DC X'33' ... ITOE#94 DC X'34' ... ITOE#95 DC X'35' ... ITOE#96 DC X'36' ... ITOE#97 DC X'08' ... ITOE#98 DC X'38' ... ITOE#99 DC X'39' ... ITOE#9A DC X'3A' ... ITOE#9B DC X'3B' ... ITOE#9C DC X'04' ... ITOE#9D DC X'14' ... ITOE#9E DC X'3E' ... ITOE#9F DC X'FF' ... 23-Mar-88 05:07:01-EST,9045;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:A-PIRARD@BLIULG11.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 23 Mar 88 05:06:36-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 23 Mar 88 05:06:42 EDT Received: from VM1.ULG.AC.BE by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 2089; Wed, 23 Mar 88 05:06:40 EDT Received: by BLIULG11 (Mailer X1.25) id 6083; Wed, 23 Mar 88 11:03:41 +0100 Date: Wed, 23 Mar 88 11:00:18 +0100 From: Andre PIRARD Subject: ASCII/ISO/which EBCDIC? summary To: ISO8859@JHUVM, Protocol Converter list , Columbia University Center for Computing Activities , IBM-KERMIT@CU20B.COLUMBIA.EDU Some time ago, I raised a discussion on several mailing lists about data communication and ASCII/ISO/EBCDIC character codes. I now realize my wording was very loose. Since then, I have had contacts with both kind people on the nets and a very knowledgeable IBM representative. I feel responsible to restate the problem correctly to avoid confusion and reflect the information, as I promised to some. I'll try to be as short as feasible. Please join the Edwin Hart's list ISO8859 at JHUVM for discussing details on codes etc... We, ASCII or EBCDIC network users must pay particular attention to character codes standards, now extending to international. Even sites not interested in international characters will sooner or later hit the problem because, albeit the situation is well defined in the ASCII world with an (often overlooked) ISO standard, it is far from that for EBCDIC users faced to a choice among several new "codes pages" whose differences lie on the positions of a few characters, strangely enough not the extended ones. The era of data communication raises an urgent need for a single character codes standard. BITNET apparently had found one. It is now silently tossed up by these new codes sets. We had been proposed "table 500" (below) without warning. And it turns out that our IBM representatives ignored the de-facto coherence of BITNET. The ISO have produced a considerable work in defining the graphics necessary for each country and assigned them codes. For latin based alphabets, this yielded the ISO 8859/1 = ANSI X3.134.2 = ECMA 94, which is wisely a superset of ISO 646 = ANSI X3.4, the well known ASCII. ISO 8859/1 assigns character graphics to the A0-FF codes range. The range 80-9F is unassigned and can be used for special purposes in 8-bit storage and transmission. But it is kept free in order to not interfere with control codes 00-1F during 7-bit transmission in compatibility with the ISO 2022, alternating between the two sets with the SI/SO control codes. Nobody questions the value of ISO and everything so far looks ideal to avoid a new Babel for the largest part of the world. IBM, in conforming EBCDIC to ISO, at least strongly claims that any EBCDIC extension shall contain exactly the ISO characters set, in order to make a revertible translation always possible, but allows variations in which particular code is assigned to an ISO character. This idea is also the origin of the IBM PC code page 850 ASCII extension and of the IBM mainframes multiple CECP's (country extended code pages) EBCDIC extensions. Why multiple? because: - Compatibility with previous codes rules IBM evolution, e. g. code page 850 contains the ISO characters, but most of the former cp 437 stay in place (missing ones expel graphic characters). - The eighty-some-characters restricted former EBCDIC did not contain all the X3.4 ASCII characters and conversely. (see IBM publication GX20-1850, the yellow book, pp 9-12 second column, let's call it simply "EBCDIC" and the third column "TN-chain"). - Some of those EBCDIC codes not in ASCII are vital for programming or using IBM systems and had to be produced from ASCII terminals. - ASCII/EBCDIC translation tables were built to accommodate these needs instead of mapping equivalent characters, varied over time and systems, and are different from those used in file transfers. - Habits, software and data built up to a huge amount. - ISO now defines the missing EBCDIC characters. - It is finally embarrassing to define a single extended EBCDIC and the proposed extensions tend to match the terminal tables rather than the more stable file transfer ones. Never mind, says IBM. As long as a particular EBCDIC extension conforms to ISO, GDDM will take care of that. And we're off on the grounds that any conforming extension will do. These extensions are called "Code Pages XXXXX" (cpXXX for short). The most prominent offerings are cp500 and cp037, more of them below, but others exist in order to best fit existing installation use. GDDM is an IBM product that will interface with the operating system, the I/O devices and the application programs in order to (for our concern) convert one particular code page to another. They say GDDM will use cp500 internally as the code page to and from which conversion will be made. I simply don't believe in (that function of) GDDM because it can only be effective when everything will have been converted to that interface. Networking is a crying example. What could GDDM do to a file (they're supposed to be code-tagged) received from a network site that does not use it? My opinion is that we have to settle on a single code NOW because the sooner the better, at least for networking, but also the recommended one. Which one? Practically, that making the most people happy certainly. And BITNET users are numerous. Other reasons favour the present code: - It must be compatible with former EBCDIC. - The compatibility with the former ASCII/EBCDIC translation is vital, because it often gets involved in conversions whose result is used as data critical for computation rather than "good looking" humanly readable text. BTW, I think that storing ASCII data on BITNET servers is best done in "binary" format (ASCII files streams split into "records" of arbitrary length, best 128). So bad for docs direct EBCDIC- wise readability. cp500 is simply not compatible with the former EBCDIC: it carries on a strange habit of using exclamation marks for what a compiler understands as a vertical bar and such things. I am told it is recommended to European because GDDM uses it internally (???) and on the ground of previous codes compatibility, but it does not preserve their accented letters :-) cp037 is EBCDIC compatible and recommended to US and Canadian. Both are not compatible with what I believe is the prominent ASCII/EBCDIC translation, that of the 7171, VM, Kermit, BITNET gateways, ASCII tapes conversion etc... and, as I am told by IBM, the 3708 and 3275. - cp037 puts brackets at BA and BB and cp500 puts them at 4A and 5A whereas traditional conversion from ASCII is to the positions in the TN chain AD and BD. - cp500 additionally deviates, because of its EBCDIC discrepancies, for ASCII "exclamation mark" and "vertical bar". - the ASCII "circumflex" uniformly translated to EBCDIC "not sign" 5F. There was no circumflex in EBCDIC, but its new ISO- based definition threatens the former conversion. - whereas the ASCII backslash is often used to give the cent sign in terminal mode, file transfers keep the EBCDIC backslash. cp037 and cp500 differ in only 7 characters. VM/SP 5 uses two TTY conversions: TERMINAL ASCIITBL VM1 or VM2. VM1, the default, is "traditional" (037 with TN chain brackets) and matches no code page. VM2 corresponds to cp500, but the brochure GC24-5328 explains that by using the 037 graphics. To add to the confusion the explanation refers to ANSI X3.4 and X3.26 respectively. My experience shows that BITNET is working perfectly as it stands. Are we going to let a chance messing up all that? And it looks like defining another code page would not be hard to get from IBM and that there is "nothing defined yet as communication standard". I think that we should strongly consider requiring another code page that matches BITNET and that it become the standard. In summary: Adopting CP037 with brackets at AD BD is easy. What I find more serious is the "ASCII circumflex" to "EBCDIC not" conversion that makes no theoretical sense now both characters are defined in the other set, but is is presently used as such in many character encoded stored binary files. I close this discussion on these lists, it now belongs to the list ISO8859. 23-Mar-88 06:36:54-EST,1210;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:A-PIRARD@BLIULG11.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 23 Mar 88 06:36:37-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 23 Mar 88 06:36:37 EDT Received: from VM1.ULG.AC.BE by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 2224; Wed, 23 Mar 88 06:36:36 EDT Received: by BLIULG11 (Mailer X1.25) id 7865; Wed, 23 Mar 88 12:35:32 +0100 Date: Wed, 23 Mar 88 12:21:37 +0100 From: Andre PIRARD Subject: Re: Non-standard EBCDIC mappings To: IBM-KERMIT@CU20B.COLUMBIA.EDU In-Reply-To: Message of 1988 Mar 14 23:41 EST from Had the situation been well defined, I would have suggested implementing the full ISO character set translation in the optional 8-bit table. But with various EBCDIC versions and pure ISO itself being rarely used, even on the IBM PC, I think the best is to wait and see. The present IBM Kermit translation table is probably what everyone silently wishes as "the" standard EBCDIC. Let us keep from encouraging exotic ones and leave the door open for compatible extension. 23-Mar-88 15:05:14-EST,2877;000000000001 Return-Path: <@um.cc.umich.edu:Bruce_Jolliffe@mtsg.ubc.ca> Received: from umix.cc.umich.edu by CU20B.COLUMBIA.EDU with TCP; Wed 23 Mar 88 15:04:53-EST Received: by umix.cc.umich.edu (5.54/umix-2.0) id AA20587; Wed, 23 Mar 88 15:08:57 EST Received: from MTSG.UBC.CA by um.cc.umich.edu via MTS-Net; Wed, 23 Mar 88 14:54:46 EST Date: Wed, 23 Mar 88 11:53:14 PST From: Bruce_Jolliffe@mtsg.ubc.ca To: IBM-Kermit@cu20b.Columbia.edu, info-kermit@cu20b.Columbia.edu, iso8859%jhuvm@umix.cc.umich.edu, ibm7171%dearn@umix.cc.umich.edu Message-Id: <972890@mtsg.ubc.ca> Subject: ISO (ASCII) to EBCDIC Standards As one of several MTS sites that have recently adopted an ISO 8859 - Code Page 37 translation table I found your note on the adoption standard ASCII-EBCDIC tables interesting. We mapped each ISO graphic to its corresponding EBCDIC graphic. Thus we mapped the EBCDIC logical not (5F) into the ISO logical not (AC). Similarily we mapped the ISO circumflex into the EBCDIC circumflex (B0) and the ISO tilde (7F) into the EBCDIC tilde (A1). As you might guess the two thorniest issues over the IBM Code Page 37 was the square brackets and the logical not. As previously noted, in another message, the square brackets in Code Page 37 are moved from their traditional TN positions of AD and BD to BA and BB respectively. The second issue concerned the logical not. At most of the MTS sites we had traditionally mapped EBCDIC logical nots into tildes. After much debate we decided it made no sense to do cross graphics mapping and decided to go with a graphic to graphic mapping. Many of the MTS sites provide general access to their IBM mainframes exclusively through ASCII terminals. Thus many applications that used the logical not as an input character had to be changed to accept the EBCDIC tilde (we had previously mapped EBCDIC logical nots to ASCII tildes). Prior to the conversion there was a lot apprehension about changing to the newer standard and we prepared for the worse. Now the conversion has been done, and we can look back the conversion was more of a nuisance rather than a major hassle. Granted it was not free, but with a reasonable amount of preparation and saturation publicity the conversion can be relatively painless. The installations that have made this change include the University of Michigan, Renssellaer Polytechnic Institute, University of British Columbia, Simon Fraser University, University of Newcastle, Durham University, and Wayne State University. The University of Alberta, the other remaining major MTS site, is due to convert this summer. Bruce Jolliffe Computing Centre University of British Columbia Bruce_Jolliffe@mtsg.ubc.ca or USERBDJ@UBCMTSG.BITNET 23-Mar-88 16:04:42-EST,1909;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 23 Mar 88 16:04:37-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 23 Mar 88 16:04:49 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 3172; Wed, 23 Mar 88 16:04:45 EDT Received: by BITNIC (Mailer X1.24) id 2885; Wed, 23 Mar 88 15:58:39 EDT Date: Wed, 23 Mar 88 15:38:39 +0100 Reply-To: Discussion list for ASCII/EBCDIC character set related issues Sender: Discussion list for ASCII/EBCDIC character set related issues From: "Alain FONTAINE (Postmaster - NAD)" Subject: Re: Some Important Comments from Howard Gilbert at Yale University To: Frank da Cruz In-Reply-To: Message of Tue, 15 Mar 88 11:17:07 EST from Quite important, indeed... But the tables shown are not correct: it is easy to verify that some values are present twice, and some others not at all. This affects six values in the EBCDIC to ASCII table, and one in the ASCII to EBCDIC table. The replacement values given here are indeed consistent, but that does not mean that they are the truth. EBCDIC to ASCII '6D' should be translated into '5F' instead of '4F' '6F' should be translated into '3F' instead of '2F' '79' should be translated into '60' instead of '6D' '7C' should be translated into '40' instead of '4D' '8F' should be translated into 'B1' instead of 'A1' 'FB' should be translated into 'DB' instead of 'D8' ASCII to EBCDIC 'E6' should be translated into '9C' instead of '96' Does anybody know better ? /AF 23-Mar-88 16:05:24-EST,2571;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 23 Mar 88 16:05:20-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 23 Mar 88 16:05:38 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 3180; Wed, 23 Mar 88 16:05:34 EDT Received: by BITNIC (Mailer X1.24) id 2907; Wed, 23 Mar 88 15:59:28 EDT Date: Wed, 23 Mar 88 09:56:36 CST Reply-To: Discussion list for ASCII/EBCDIC character set related issues Sender: Discussion list for ASCII/EBCDIC character set related issues From: Michael Sperberg-McQueen Subject: Thorn To: Frank da Cruz I have not yet had the chance to look all through Howard Gilbert's translate table, but can answer the query about thorn: in the EBCDIC CECP for the US, uppercase thorn is at AE and lowercase thorn is at 8E. Apart from the typography (which I admit is not always real clear for non-readers of Icelandic or Old English), these contextual clues should tip you off: 8C-8E (lowercase) correspond to AC-AE (uppercase), and the IBM identifying code (LT630000 and LT640000) for uppercase letters (LT640000 in this case) is consistently 10000 higher than the code for the corresponding lowercase letters. The typographic differences (in case anyone has to design a font for these!) are: - the lowercase thorn has a descender and an ascender; its bowl rests on the base line. (so it is sometimes simulated on non-Icelandic typewriters by overstriking 'b' and 'p', unless they have serifs, or by overstriking right-bracket and 'o') - the uppercase thorn is standard upper-case height, has no descender, and its bowl is at mid-letter height, like the bowl on a 'P' that has slipped down a bit. Speaking of fonts -- I have designed an ISO8859 font for the IBM3163 terminal, using font design software which was unsigned but I believe came from Penn. It's utilitarian, not beautiful, more or less matches the native IBM3163 fonts, and anyone who wants it can have it if they promise to send me any improvements they make. (It can also be downloaded and used as a start on a PC font, since the cell sizes are similar but the base line and line thickness are different.) Michael Sperberg-McQueen, University of Illinois at Chicago 23-Mar-88 23:10:21-EST,3250;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 23 Mar 88 23:10:00-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 23 Mar 88 22:41:12 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 3772; Wed, 23 Mar 88 22:41:10 EDT Received: by BITNIC (Mailer X1.24) id 5808; Wed, 23 Mar 88 22:35:21 EDT Date: Wed, 23 Mar 88 11:53:14 PST Reply-To: Discussion list for ASCII/EBCDIC character set related issues Sender: Discussion list for ASCII/EBCDIC character set related issues From: Bruce Jolliffe Subject: ISO (ASCII) to EBCDIC Standards X-To: IBM-Kermit@cu20b.Columbia.edu, info-kermit@cu20b.Columbia.edu, iso8859@JHUVM, ibm7171@DEARN To: Frank da Cruz As one of several MTS sites that have recently adopted an ISO 8859 - Code Page 37 translation table I found your note on the adoption standard ASCII-EBCDIC tables interesting. We mapped each ISO graphic to its corresponding EBCDIC graphic. Thus we mapped the EBCDIC logical not (5F) into the ISO logical not (AC). Similarily we mapped the ISO circumflex into the EBCDIC circumflex (B0) and the ISO tilde (7F) into the EBCDIC tilde (A1). As you might guess the two thorniest issues over the IBM Code Page 37 was the square brackets and the logical not. As previously noted, in another message, the square brackets in Code Page 37 are moved from their traditional TN positions of AD and BD to BA and BB respectively. The second issue concerned the logical not. At most of the MTS sites we had traditionally mapped EBCDIC logical nots into tildes. After much debate we decided it made no sense to do cross graphics mapping and decided to go with a graphic to graphic mapping. Many of the MTS sites provide general access to their IBM mainframes exclusively through ASCII terminals. Thus many applications that used the logical not as an input character had to be changed to accept the EBCDIC tilde (we had previously mapped EBCDIC logical nots to ASCII tildes). Prior to the conversion there was a lot apprehension about changing to the newer standard and we prepared for the worse. Now the conversion has been done, and we can look back the conversion was more of a nuisance rather than a major hassle. Granted it was not free, but with a reasonable amount of preparation and saturation publicity the conversion can be relatively painless. The installations that have made this change include the University of Michigan, Renssellaer Polytechnic Institute, University of British Columbia, Simon Fraser University, University of Newcastle, Durham University, and Wayne State University. The University of Alberta, the other remaining major MTS site, is due to convert this summer. Bruce Jolliffe Computing Centre University of British Columbia Bruce_Jolliffe@mtsg.ubc.ca or USERBDJ@UBCMTSG.BITNET 24-Mar-88 09:23:23-EST,2803;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Thu 24 Mar 88 09:23:17-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Thu, 24 Mar 88 09:23:41 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 4191; Thu, 24 Mar 88 09:23:40 EDT Received: by BITNIC (Mailer X1.24) id 0618; Thu, 24 Mar 88 09:18:10 EDT Date: Thu, 24 Mar 88 07:19:59 EST Reply-To: Discussion list for ASCII/EBCDIC character set related issues Sender: Discussion list for ASCII/EBCDIC character set related issues From: John C Klensin Subject: RE: How to get a copy of ISO8859 To: Frank da Cruz The way to obtain any ISO standard is to go to one's own national standards body and order it through them. In various countries, national depository libraries have them also and other libraries have at least some on standing subscriptions. The ISO Central Secretariat in Geneva tries to stay out of the bookstore business and I gather will not sell standards to individuals. The national standards bodies act as ISO's sales agents within their own countries. Bo, this means that you have to find ISO8859 in Norway and, for the other readers of this list, similarly. For readers in the USA, ISO standards are obtained through ANSI. It is best to call their order department at 212/642-4900 and get price and shipping information. If you must write, they are at 1430 Broadway, New York City, NY 10018. Specifying "order department" in the address will save a bit of time. I don't have the information on enough other countries handy to make it worth listing them. I recommend that people outside the USA not try to order through ANSI for two reasons - they might refuse to sell them to you, and, since the publications department is a major source of funds for ANSI, their prices for ISO standards are often significantly higher than the prices of many other national bodies (some of which, I gather, give the things away). Specific warning about ISO 8859: It is not one standard, but a whole family of things, starting with what used to be called "eight-bit ASCII" and is now known as "Latin alphabet-1" (ISO8859/1), and extending into a large variety of things (many still in draft) that cover mixtures of the simple characters the Romans used with a large assortment of specialized graphics, embellished Roman, and the character sets of other languages. Since they are all "part of" 8859, ordering "8859" is likely to get you a lot of documents at a proportionately high price. 24-Mar-88 13:28:38-EST,6620;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:LISTSERV@BITNIC.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Thu 24 Mar 88 13:28:32-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Thu, 24 Mar 88 13:28:33 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 5734; Thu, 24 Mar 88 13:28:28 EDT Received: by BITNIC (Mailer X1.24) id 2911; Thu, 24 Mar 88 13:21:03 EDT Date: Thu, 24 Mar 1988 13:21:01 EDT Sender: "Revised List Processor (1.5m)" From: Johan van Wingen To: Frank da Cruz Subject: File "MOSGLA XMIT" being sent to you. *--------------------------------- Cut here ----------------------------------* )\INMR01a&HLERUL2MOSGLA JHUVMISO8859 1988032416 56\HINMR02INMCOPYo- a&    MOSGLAXMITDATA -ISOLIST\INMR03o- a& *{Dear list users *{ECMA standards, which are generally identical with the corresponding *{ones from ISO, can be ordered from: ECMA, 114 Rue du Rhone, CH-1204, *{Geneve, Switzerland. *{The following is a rather comprehensive list of everything availa ble. *{ *{1 *{ INTERNATIONAL STANDARDS FOR CHARACTER CODES AND RELATED S UBJECTS *{ *{ ISO 646-1983 ISO 7-bit coded character set for infor mation interchange *{ ISO 2022-1986 ISO 7-bit and 8-bit coded character s ets - *{ Code extension techniques *{ ISO 2047-1975 Graphical representations for the control characters of *{ the 7-bit coded character set *{ ISO 2375-1985 Procedure for the registratio n of escape sequences *{ ISO 4873-1985 8-bit code for information interchange - *{ Structure and rules for im plementation *{ ISO 5426-1983 Extension of the Latin alphabet coded character set for *{ bibliographic informat ion interchange *{ ISO 5428-1984 Greek alphabet code d character set for *{ bibliographic info rmation interchange *{ ISO 6429 DIS ISO 7-bit and 8 -bit coded character sets - *{ additional con trol functions for character-imaging devices *{ ISO 6862 DIS Mathematica l coded character set for *{ bibliograp hic information interchange *{ ISO 6937 Coded char acter sets for textcommunication *{ ISO 6937/1-1983 Gen eral Introduction *{ ISO 6937/2-1987 L atin alphabetic and non-alphabetic graphic characters *{ ISO 6937/3 DIS Control functions for page-image format *{ ISO 6937/4 DP Text-processible format *{ ISO 6937/5 DP Scientific and technical graphic characters *{ ISO 6937/ 6 DP Publishing and box drawing graphic characters *{ ISO 693 7/7 DIS Greek graphic characters (to be withdrawn) *{ ISO 6 937/8 DIS Cyrillic graphic characters (to be withdrawn) *{ ISO 7350 DIS Text communication - *{ registration of graphic character subrepertoires *{ ISO 8859 8-bit single byte coded graphic characters * { ISO 8859/1-1987 Latin alphabet no. 1 *{ ISO 8859/2-1987 Latin alphabet no. 2 *{ ISO 8859/3-DIS Latin alphabet no. 3 *{ ISO 8859/4-DIS Latin alphabet no. 4 *{ ISO 8859/5-DIS Latin/Cyrillic alphabet *{ ISO 8859/6-1987 Latin/Arabic alphabet *{ ISO 8859/7-1987 Latin/Greek alphabet *{ ISO 8859/8-DIS Latin/Hebrew alphabet *{ ISO 8884 DIS Keyboard layout for multiple Latin-alphabet lan guages *{ ISO 9036-1987 Arabic 7-bit coded character set for informat ion interchange *{ *{ (DIS : Draft International Standard; DP : Draft Proposa l) *{1 *{ Correspondence between ISO and ECMA standards *{ ISO ECMA Registration number of escape se quence (ISO 2375) *{ 8859/1 94 100 *{ 8859/2 94 101 *{ 8859/3 94 109 *{ 8859/4 94 110 *{ 8859/5 113 111 *{ 8859/6 114 127 *{ 8859/7 118 126 *{ 8859/8 121 138 *{ *{ National Standards *{ *{ ANSI X3.04-1977 Code for Information Interchange *{ 1GOST 19767-74--GOST 197 69-74, GOST 13052-74 *{ 1Main\ v\yislitel'n\e , sistem\ obrabotki i apparatura peredayi dann\h *{ (to be withdrawn, a nd replaced by a new version) *{ CAS GB 2312-80 C oded Chinese graphic character set for *{ in formation interchange *{ JIS C 6226-1 983 Japanese graphic character set for *{ information interchange *{ *{ Some li tterature *{ *{ C. E. Mackenzie, Coded Character sets, History and Development, 1980 *{ J oan M. Smith, Transmitting Text, Ass. for Lit. and Ling. Computing, *{ Bulletin, Vol. 11, no. 2, 1983  \INMR06 24-Mar-88 13:55:41-EST,4716;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Thu 24 Mar 88 13:55:35-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Thu, 24 Mar 88 13:50:49 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6006; Thu, 24 Mar 88 13:50:47 EDT Received: by BITNIC (Mailer X1.24) id 3461; Thu, 24 Mar 88 13:43:58 EDT Date: Thu, 24 Mar 88 11:36:38 EST Reply-To: Discussion list for ASCII/EBCDIC character set related issues Sender: Discussion list for ASCII/EBCDIC character set related issues From: Edwin Hart Subject: List of Character Coding Standards To: Frank da Cruz Enclosed is a list of character set standards I received from MOSGLA @ HLERUL2. Dear list users _________ ECMA standards, which are generally identical with the corresponding ones from ISO, can be ordered from: ECMA, 114 Rue du Rhone, CH-1204, Geneve, Switzerland. The following is a rather comprehensive list of everything available. INTERNATIONAL STANDARDS FOR CHARACTER CODES AND RELATED SUBJECTS ISO 646-1983 ISO 7-bit coded character set for information interchange ISO 2022-1986 ISO 7-bit and 8-bit coded character sets - Code extension techniques ISO 2047-1975 Graphical representations for the control characters of the 7-bit coded character set ISO 2375-1985 Procedure for the registration of escape sequences ISO 4873-1985 8-bit code for information interchange - Structure and rules for implementation ISO 5426-1983 Extension of the Latin alphabet coded character set for bibliographic information interchange ISO 5428-1984 Greek alphabet coded character set for bibliographic information interchange ISO 6429 DIS ISO 7-bit and 8-bit coded character sets - additional control functions for character-imaging devices ISO 6862 DIS Mathematical coded character set for bibliographic information interchange ISO 6937 Coded character sets for text communication ISO 6937/1-1983 General Introduction ISO 6937/2-1987 Latin alphabetic and non-alphabetic graphic characters ISO 6937/3 DIS Control functions for page-image format ISO 6937/4 DP Text-processible format ISO 6937/5 DP Scientific and technical graphic characters ISO 6937/6 DP Publishing and box drawing graphic characters ISO 6937/7 DIS Greek graphic characters (to be withdrawn) ISO 6937/8 DIS Cyrillic graphic characters (to be withdrawn) ISO 7350 DIS Text communication - registration of graphic character subrepertoires ISO 8859 8-bit single byte coded graphic characters ISO 8859/1-1987 Latin alphabet no. 1 ISO 8859/2-1987 Latin alphabet no. 2 ISO 8859/3-DIS Latin alphabet no. 3 ISO 8859/4-DIS Latin alphabet no. 4 ISO 8859/5-DIS Latin/Cyrillic alphabet ISO 8859/6-1987 Latin/Arabic alphabet ISO 8859/7-1987 Latin/Greek alphabet ISO 8859/8-DIS Latin/Hebrew alphabet ISO 8884 DIS Keyboard layout for multiple Latin-alphabet languages ISO 9036-1987 Arabic 7-bit coded character set for information interchange (DIS : Draft International Standard; DP : Draft Proposal) Correspondence between ISO and ECMA standards ISO ECMA Registration number of escape sequence (ISO 2375) 8859/1 94 100 8859/2 94 101 8859/3 94 109 8859/4 94 110 8859/5 113 111 8859/6 114 127 8859/7 118 126 8859/8 121 138 National Standards ANSI X3.04-1986 7-bit ASCII Code for Information Interchange ANSI X3.26 Punched Card Standard (ref. for IBM ASCII-EBCDIC translation) ANSI X3.41 7-bit ASCII character extensions, corresponds to ISO 2022 ANSI X3.134.2 (proposed) 8-bit ASCII, corresponds to ISO 8859-1 GOST 19767-74--GOST 19769-74, GOST 13052-74 Main\ v\yislitel'n\e, sistem\ obrabotki i apparatura peredayi dann\h (to be withdrawn, and replaced by a new version) CAS GB 2312-80 Coded Chinese graphic character set for information interchange JIS C 6226-1983 Japanese graphic character set for information interchange Some literature C. E. Mackenzie, Coded Character sets, History and Development, 1980 Joan M. Smith, Transmitting Text, Ass. for Lit. and Ling. Computing, Bulletin, Vol. 11, no. 2, 1983 25-Mar-88 02:53:09-EST,2918;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 25 Mar 88 02:53:01-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 25 Mar 88 02:53:18 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6884; Fri, 25 Mar 88 02:53:16 EDT Received: by BITNIC (Mailer X1.24) id 9708; Fri, 25 Mar 88 02:48:06 EDT Date: Fri, 25 Mar 88 08:37:11 +0100 Reply-To: Discussion list for ASCII/EBCDIC character set related issues Sender: Discussion list for ASCII/EBCDIC character set related issues From: "Alain FONTAINE (Postmaster - NAD)" Subject: Re: ISO (ASCII) to EBCDIC Standards To: Frank da Cruz I've followed the discussion, and tried to keep up with all remarks and corrections... As a result, I've produced the two following tables, which are at least complete and coherent. This still does not mean that they are completely right. Any help would be appreciated. /AF P.S. they are in REXX syntax because I used REXX to check the consistency.. /* EBCDIC -> ASCII */ asc8859 = '000102039C09867F978D8E0B0C0D0E0F'x||, '101112139D8508871819928F1C1D1E1F'x||, '80818283840A171B88898A8B8C050607'x||, '909116939495960498999A9B14159E1A'x||, '20A0E2E4E0E1E3E5E7F1A22E3C282B7C'x||, '26E9EAEBE8EDEEEFECDF21242A293BAC'x||, '2D2FC2C4C0C1C3C5C7D1A62C255F3E3F'x||, 'F8C9CACBC8CDCECFCC603A2340273D22'x||, 'D8616263646566676869ABBBF0FDFEB1'x||, 'B06A6B6C6D6E6F707172AABAE6B8C6A4'x||, 'B57E737475767778797AA1BFD0DDDEAE'x||, '5EA3A5B7A9A7B6BCBDBE5B5DAFA8B4D7'x||, '7B414243444546474849ADF4F6F2F3F5'x||, '7D4A4B4C4D4E4F505152B9FBFCF9FAFF'x||, '5CF7535455565758595AB2D4D6D2D3D5'x||, '30313233343536373839B3DBDCD9DA9F'x /* ASCII -> EBCDIC */ ebc8859 = '00010203372D2E2F1605250B0C0D0E0F'x||, '101112133C3D322618193F271C1D1E1F'x||, '405A7F7B5B6C507D4D5D5C4E6B604B61'x||, 'F0F1F2F3F4F5F6F7F8F97A5E4C7E6E6F'x||, '7CC1C2C3C4C5C6C7C8C9D1D2D3D4D5D6'x||, 'D7D8D9E2E3E4E5E6E7E8E9BAE0BBB06D'x||, '79818283848586878889919293949596'x||, '979899A2A3A4A5A6A7A8A9C04FD0A107'x||, '202122232415061728292A2B2C090A1B'x||, '30311A333435360838393A3B04143EFF'x||, '41AA4AB19FB26AB5BDB49A8A5FCAAFBC'x||, '908FEAFABEA0B6B39DDA9B8BB7B8B9AB'x||, '6465626663679E687471727378757677'x||, 'AC69EDEEEBEFECBF80FDFEFBFCADAE59'x||, '4445424643479C485451525358555657'x||, '8C49CDCECBCFCCE170DDDEDBDC8D8EDF'x 25-Mar-88 05:16:23-EST,7033;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 25 Mar 88 05:16:15-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 25 Mar 88 05:16:30 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6911; Fri, 25 Mar 88 05:16:29 EDT Received: by BITNIC (Mailer X1.24) id 0258; Fri, 25 Mar 88 05:10:17 EDT Date: Fri, 25 Mar 88 10:52:20 +0100 Reply-To: Discussion list for ASCII/EBCDIC character set related issues Sender: Discussion list for ASCII/EBCDIC character set related issues From: Andre PIRARD Subject: IBM official translate tables To: Frank da Cruz I've obtained from IBM the following translate tables, so-said official. They apply to CECP 500 vs IBM PC cp850 or cp437. I may ask for CECP 037 and ISO8859 as well if anyone is interested. I'll "batch the orders" and post the answer to the list. Any comment? ------------------------------------------------------------------------- FROM: INTERNATL 697/500 TO: PC 980/850 --------------------------------------------------------------- -0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -A -B -C -D -E -F --------------------------------------------------------------- 0- 00 01 02 03 DC 09 C3 9F CA B2 D5 0B 0C 0D 0E 0F 1- 10 11 12 13 DB DA 08 C1 18 19 C8 F2 1C 1D 1E 1F 2- C4 B3 C0 D9 BF 0A 17 1B B4 C2 C5 B0 B1 05 06 07 3- CD BA 16 BC BB C9 CC 04 B9 CB CE DF 14 15 FE 1A 4- 20 FF 83 84 85 A0 C6 86 87 A4 5B 2E 3C 28 2B 21 5- 26 82 88 89 8A A1 8C 8B 8D E1 5D 24 2A 29 3B 5E 6- 2D 2F B6 8E B7 B5 C7 8F 80 A5 DD 2C 25 5F 3E 3F 7- 9B 90 D2 D3 D4 D6 D7 D8 DE 60 3A 23 40 27 3D 22 8- 9D 61 62 63 64 65 66 67 68 69 AE AF D0 EC E7 F1 9- F8 6A 6B 6C 6D 6E 6F 70 71 72 A6 A7 91 F7 92 CF A- E6 7E 73 74 75 76 77 78 79 7A AD A8 D1 ED E8 A9 B- BD 9C BE FA B8 F5 F4 AC AB F3 AA 7C EE F9 EF 9E C- 7B 41 42 43 44 45 46 47 48 49 F0 93 94 95 A2 E4 D- 7D 4A 4B 4C 4D 4E 4F 50 51 52 FB 96 81 97 A3 98 E- 5C F6 53 54 55 56 57 58 59 5A FD E2 99 E3 E0 E5 F- 30 31 32 33 34 35 36 37 38 39 FC EA 9A EB E9 7F --------------------------------------------------------------- FROM: PC 980/850 TO: INTERNATL 697/500 --------------------------------------------------------------- -0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -A -B -C -D -E -F --------------------------------------------------------------- 0- 00 01 02 03 37 2D 2E 2F 16 05 25 0B 0C 0D 0E 0F 1- 10 11 12 13 3C 3D 32 26 18 19 3F 27 1C 1D 1E 1F 2- 40 4F 7F 7B 5B 6C 50 7D 4D 5D 5C 4E 6B 60 4B 61 3- F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 7A 5E 4C 7E 6E 6F 4- 7C C1 C2 C3 C4 C5 C6 C7 C8 C9 D1 D2 D3 D4 D5 D6 5- D7 D8 D9 E2 E3 E4 E5 E6 E7 E8 E9 4A E0 5A 5F 6D 6- 79 81 82 83 84 85 86 87 88 89 91 92 93 94 95 96 7- 97 98 99 A2 A3 A4 A5 A6 A7 A8 A9 C0 BB D0 A1 FF 8- 68 DC 51 42 43 44 47 48 52 53 54 57 56 58 63 67 9- 71 9C 9E CB CC CD DB DD DF EC FC 70 B1 80 BF 07 A- 45 55 CE DE 49 69 9A 9B AB AF BA B8 B7 AA 8A 8B B- 2B 2C 09 21 28 65 62 64 B4 38 31 34 33 B0 B2 24 C- 22 17 29 06 20 2A 46 66 1A 35 08 39 36 30 3A 9F D- 8C AC 72 73 74 0A 75 76 77 23 15 14 04 6A 78 3B E- EE 59 EB ED CF EF A0 8E AE FE FB FD 8D AD BC BE F- CA 8F 1B B9 B6 B5 E1 9D 90 BD B3 DA FA EA 3E 41 --------------------------------------------------------------- FROM: INTERNATL 697/500 TO: PC 919/437 --------------------------------------------------------------- -0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -A -B -C -D -E -F --------------------------------------------------------------- 0- 00 01 02 03 DC 09 C3 9F CA B2 D5 0B 0C 0D 0E 0F 1- 10 11 12 13 DB DA 08 C1 18 19 C8 F2 1C 1D 1E 1F 2- C4 B3 C0 D9 BF 0A 17 1B B4 C2 C5 B0 B1 05 06 07 3- CD BA 16 BC BB C9 CC 04 B9 CB CE DF F4 F5 FE 1A 4- 20 FF 83 84 85 A0 C6 86 87 A4 5B 2E 3C 28 2B 21 5- 26 82 88 89 8A A1 8C 8B 8D E1 5D 24 2A 29 3B 5E 6- 2D 2F B6 8E B7 B5 C7 8F 80 A5 DD 2C 25 5F 3E 3F 7- BD 90 D2 D3 D4 D6 D7 D8 DE 60 3A 23 40 27 3D 22 8- BE 61 62 63 64 65 66 67 68 69 AE AF D0 EC E7 F1 9- F8 6A 6B 6C 6D 6E 6F 70 71 72 A6 A7 91 F7 92 CF A- E6 7E 73 74 75 76 77 78 79 7A AD A8 D1 ED E8 A9 B- 9B 9C 9D FA B8 15 14 AC AB F3 AA 7C EE F9 EF 9E C- 7B 41 42 43 44 45 46 47 48 49 F0 93 94 95 A2 E4 D- 7D 4A 4B 4C 4D 4E 4F 50 51 52 FB 96 81 97 A3 98 E- 5C F6 53 54 55 56 57 58 59 5A FD E2 99 E3 E0 E5 F- 30 31 32 33 34 35 36 37 38 39 FC EA 9A EB E9 7F --------------------------------------------------------------- FROM: PC 919/437 TO: INTERNATL 697/500 --------------------------------------------------------------- -0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -A -B -C -D -E -F --------------------------------------------------------------- 0- 00 01 02 03 37 2D 2E 2F 16 05 25 0B 0C 0D 0E 0F 1- 10 11 12 13 B6 B5 32 26 18 19 3F 27 1C 1D 1E 1F 2- 40 4F 7F 7B 5B 6C 50 7D 4D 5D 5C 4E 6B 60 4B 61 3- F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 7A 5E 4C 7E 6E 6F 4- 7C C1 C2 C3 C4 C5 C6 C7 C8 C9 D1 D2 D3 D4 D5 D6 5- D7 D8 D9 E2 E3 E4 E5 E6 E7 E8 E9 4A E0 5A 5F 6D 6- 79 81 82 83 84 85 86 87 88 89 91 92 93 94 95 96 7- 97 98 99 A2 A3 A4 A5 A6 A7 A8 A9 C0 BB D0 A1 FF 8- 68 DC 51 42 43 44 47 48 52 53 54 57 56 58 63 67 9- 71 9C 9E CB CC CD DB DD DF EC FC B0 B1 B2 BF 07 A- 45 55 CE DE 49 69 9A 9B AB AF BA B8 B7 AA 8A 8B B- 2B 2C 09 21 28 65 62 64 B4 38 31 34 33 70 80 24 C- 22 17 29 06 20 2A 46 66 1A 35 08 39 36 30 3A 9F D- 8C AC 72 73 74 0A 75 76 77 23 15 14 04 6A 78 3B E- EE 59 EB ED CF EF A0 8E AE FE FB FD 8D AD BC BE F- CA 8F 1B B9 3C 3D E1 9D 90 BD B3 DA FA EA 3E 41 --------------------------------------------------------------- 25-Mar-88 06:44:36-EST,5111;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 25 Mar 88 06:44:26-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 25 Mar 88 06:44:39 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6929; Fri, 25 Mar 88 06:44:37 EDT Received: by BITNIC (Mailer X1.24) id 0723; Fri, 25 Mar 88 06:39:10 EDT Date: Fri, 25 Mar 88 12:27:00 MET Reply-To: Discussion list for ASCII/EBCDIC character set related issues Sender: Discussion list for ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: cp37/500 To: Frank da Cruz Dear List Subscribers It took me some time to compare Mr. Gilbert's conversion table with others in use. It is simply not true that there was no "official" translate table before CP37 and CP500 turned up. There is one in VS FORTRAN Language and Library Reference, SC26-4119-1, Appendix C, p. 365-370. There is even a Government Standard, exactly identical to this, but is not a US one, it is found in GOST 19768 of the USSR, issued in 1974. This is the thing I use as the most authoritative reference. The combination of this table with ISO 8859-1 produces a unique code page, which I implemented using IEBIMAGE at our STC/Siemens laser printer (working in IBM 3800 compatibility mode), based on DOTR. I did the same thing with ISO 8859-2 for Eastern European languages. The only concession to present practice was the exchange of "logical not" with "circumflex", and a shift between right square bracket, exclamation sign, and vertical bar. I see no reason why to invent a new table for ISO 80-FF, creating further confusion. It could even involve changing the VS FORTRAN compiler. CP37 and CP500 ought to be withdrawn. Yours faithfully, Johan van Wingen This is the table (in ISO format, IBM mirrors this sometimes): CONVERSION FROM ASCII TO EBCDIC 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. A. B. C. D. E. F. .0 00 10 40 F0 7C D7 79 97 20 30 41 58 76 9F B8 DC .1 01 11 4F F1 C1 D8 81 98 21 31 42 59 77 A0 B9 DD .2 02 12 7F F2 C2 D9 82 99 22 1A 43 62 78 AA BA DE .3 03 13 7B F3 C3 E2 83 A2 23 33 44 63 80 AB BB DF .4 37 3C 5B F4 C4 E3 84 A3 24 34 45 64 8A AC BC EA .5 2D 3D 6C F5 C5 E4 85 A4 15 35 46 65 8B AD BD EB .6 2E 32 50 F6 C6 E5 86 A5 06 36 47 66 8C AE BE EC .7 2F 26 7D F7 C7 E6 87 A6 17 08 48 67 8D AF BF ED .8 16 18 4D F8 C8 E7 88 A7 28 38 49 68 8E B0 CA EE .9 05 19 5D F9 C9 E8 89 A8 29 39 51 69 8F B1 CB EF .A 25 3F 5C 7A D1 E9 91 A9 2A 3A 52 70 90 B2 CC FA .B 0B 27 4E 5E D2 4A 92 C0 2B 3B 53 71 9A B3 CD FB .C 0C 1C 6B 4C D3 E0 93 6A 2C 04 54 72 9B B4 CE FC .D 0D 1D 60 7E D4 5A 94 D0 09 14 55 73 9C B5 CF FD .E 0E 1E 4B 6E D5 5F 95 A1 0A 3E 56 74 9D B6 DA FE .F 0F 1F 61 6F D6 6D 96 07 1B E1 57 75 9E B7 DB FF CONVERSION FROM EBCDIC TO ASCII 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. A. B. C. D. E. F. .0 00 10 80 90 20 26 2D BA C3 CA D1 D8 7B 7D 5C 30 .1 01 11 81 91 A0 A9 2F BB 61 6A 7E D9 41 4A 9F 31 .2 02 12 82 16 A1 AA B2 BC 62 6B 73 DA 42 4B 53 32 .3 03 13 83 93 A2 AB B3 BD 63 6C 74 DB 43 4C 54 33 .4 9C 9D 84 94 A3 AC B4 BE 64 6D 75 DC 44 4D 55 34 .5 09 85 0A 95 A4 AD B5 BF 65 6E 76 DD 45 4E 56 35 .6 86 08 17 96 A5 AE B6 C0 66 6F 77 DE 46 4F 57 36 .7 7F 87 1B 04 A6 AF B7 C1 67 70 78 DF 47 50 58 37 .8 97 18 88 98 A7 B0 B8 C2 68 71 79 E0 48 51 59 38 .9 8D 19 89 99 A8 B1 B9 60 69 72 7A E1 49 52 5A 39 .A 8E 92 8A 9A 5B 5D 7C 3A C4 CB D2 E2 E8 EE F4 FA .B 0B 8F 8B 9B 2E 24 2C 23 C5 CC D3 E3 E9 EF F5 FB .C 0C 1C 8C 14 3C 2A 25 40 C6 CD D4 E4 EA F0 F6 FC .D 0D 1D 05 15 28 29 5F 27 C7 CE D5 E5 EB F1 F7 FD .E 0E 1E 06 9E 2B 3B 3E 3D C8 CF D6 E6 EC F2 F8 FE .F 0F 1F 07 1A 21 5E 3F 22 C9 D0 D7 E7 ED F3 F9 FF DEVIATIONS: ASCII TO EBCDIC EBCDIC TO ASCII UNPRINTABLE | 21 5D 5E 09 0A 1C FF 4F 5A 5F 15 17 22 24 35 E1 FF | STANDARD 4F 5A 5F 05 25 1C 00 21 5D 5E 85 87 82 84 95 9F FF PDP-HASP 5A 5F 4F 05 25 22 07 5E 21 5D 00 00 1C 00 1E 00 7F 00 VAX-SNA 4F 5A 5F 40 25 1C 3F 21 5D 5E 5C 5C 5C 5C 5C 5C 5C 5C VAX SUBR 4F 5A 5F 05 25 1C FF 21 5D 5E 0A 1B 5C 5C 5C 5C FF 5C VTAM 4F 5A 5F 05 15 1C DELETED 21 5D 5E 0A 00 5C 00 5C 00 7F 00 TSO-KERMIT 4F 5A 5F 05 25 1C 00 21 5D 5E 0A 1B 00 00 00 00 00 00 PC-3278 AD 5A 4F 5F 05 25 1C 00 5D 21 5E 85 87 82 84 95 9F FF EARN/BITNET A 21 5B 5D 7C 85 8A D5 E3 E5 FC E 15 2A 4A 4F 5A 6A AD BB BD FC E 5A AD BD 4F 2A 15 BB 4A FC 6A A 8A 85 E3 7C 21 FC 5B D5 5D E5 FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 25-Mar-88 07:31:19-EST,2977;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 25 Mar 88 07:31:13-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 25 Mar 88 07:31:28 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6971; Fri, 25 Mar 88 07:31:27 EDT Received: by BITNIC (Mailer X1.24) id 0909; Fri, 25 Mar 88 07:26:10 EDT Date: Fri, 25 Mar 88 11:12:03 +0100 Reply-To: Discussion list for ASCII/EBCDIC character set related issues Sender: Discussion list for ASCII/EBCDIC character set related issues From: Andre PIRARD Subject: Re: ASCII/ISO/which EBCDIC? summary To: Frank da Cruz In-Reply-To: Message of Wed, 23 Mar 88 20:10:42 EST from >>My experience shows that BITNET is working perfectly as it >>stands. Are we going to let a chance messing up all that? > >I agree that this discussion be moved to the other list, but before >I do I can't help but point out that the above statement that BITNET >is "working perfectly" is one of the silliest things I have heard in >a long time, and it is a shame because this was an otherwise >fairly reasonable note. These words ask for a public reply. *From context*, the statement applies to ASCII/EBCDIC 7-bit codes translation of mail (through gateways or retrieving stored data obtained through them) and to receiving the same codes entered at EBCDIC terminals. *My experience* shows that, for example, we've never had any problem sending or receiving UUENCODEd or BOOed binary data, a good test because it uses every possible ASCII code in a message. And that this translation matched everything I could get my hands on. This is what threatens extension to 8 bits. This experience might be limited to a subset of BITNET or of its use however. This is why I have first queried the net to make up my mind. All I could hear of was some "sometime somewhere somebody...". I would have liked to evaluate that numerically by sending a simple form to be filled by a random sample of BITNET sites. But I have no time to do this and the questions to ask had to wait for some discussion first. Maybe after a while of ISO8859 good thinking, someone could undertake the project... That parity, uselessly reducing transmissions to 7 bits, is nonsense, that it is a pity we have to use mail to send binary data, and that other things could be better are all subjects I agree with but that were not the point of my note. But that the guy next door is suddenly typing hieroglyphs for brackets because CECP 500 has fallen upon him and that multiplying 3 EBCDIC by 3 ASCII codes sets gives us 9 translation tables pairs to choose from in the best case, *that* is really silly. 29-Mar-88 09:40:33-EST,5895;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 29 Mar 88 09:40:21-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 29 Mar 88 09:40:57 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 4009; Tue, 29 Mar 88 09:40:55 EDT Received: by BITNIC (Mailer X1.24) id 4916; Tue, 29 Mar 88 09:38:49 EDT Date: Tue, 29 Mar 88 15:30:00 MET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: Accented Letters To: Frank da Cruz Dear List Subscribers The discussion of character codes shows that problems can be classified: 1. Problems with differing conversion tables EBCDIC/ISO8859. 2. Problems with available characters. As for 1, there is the traditional table, as found in the FORTRAN manual. BITNET table deviates from this at a few places in transferring codes in ASCII 00-7F (left part). For details see my previous letter. Then there is the table based on CP37/500 with a complete different right part (80-FF). As for 2, there is the national character problem, which can only be solved using the character sets of ISO 8859. Both issues should not be confused with each other. Distributing the ISO 8859 characters over a code page cannot be done in an arbitrary way. As soon as you choose the conversion table the result is fixed, and conversely, every code page created fixes its conversion table. So it is up to your choice to determine what is convenient. From the information I received I tried to reconstruct the CP500 code page, SH35-0053 not being available here. Then I compared it with the FORTRAN code page, as derived from the FORTRAN conversion table. Which do you prefer? Are the differences really worth the confusion? Yours faithfully, Johan van Wingen A COMPARISON OF FACILITIES FOR LETTERS WITH DIACRITICS Notation (descriptions taken from ISO 6937-2, additions between parentheses) / acute accent \ grave accent ^ circumflex accent % diaeresis (umlaut, trema) ~ tilde * caron (hachek) # breve (Rumanian a) # double acute accent (Hungarian o,u) @ ring (above: a,u) @ dot (above: z) = macron (upper line) $ cedilla (c,s,t) $ ogonek (Polish a,e) $ (barred: o, eth, thorn) _ (underline, fraction) & (ligature: ae,oe,sz) ? (dot under) REPRESENTATION OF LETTERS FROM ISO 8859-1 WITH FORTRANTABLE 4. 5. 6. 7. 8. 9. A. B. C. D. E. F. .0 ~A ^E ~N $O 0 .1 a j \U A J 1 .2 b k s /U B K S 2 .3 c l t ^U C L T 3 .4 d m u %U D M U 4 .5 e n v /Y E N V 5 .6 \A f o w $P F O W 6 .7 /A g p x &s G P X 7 .8 ^A h q y \a H Q Y 8 .9 i r z /a I R Z 9 .A %A %E \O ^a \e ^i ^o /u .B @A \I /O ~a /e %i ~o ^u .C &A /I ^O %a ^e $d %o %u .D $C ^I ~O @a %e ~n /y .E /E %I %O &a \i \o $o $p .F \E $D $c /i /o \u %y REPRESENTATION OF LETTERS FROM ISO 8859-1 WITH CP500 TABLE 4. 5. 6. 7. 8. 9. A. B. C. D. E. F. .0 $o $O .1 /e /E a j A J 1 .2 ^a ^e ^A ^E b k s B K S 2 .3 %a %e %A %E c l t C L T 3 .4 \a \e \A \E d m u D M U 4 .5 /a /i /A /I e n v E N V 5 .6 ~a ^i ~A ^I f o w F O W 6 .7 @a %i @A %I g p x G P X 7 .8 $c \i $C \I h q y H Q Y 8 .9 ~n &s ~N i r z I R Z 9 .A .B ^o ^u ^O ^U .C $d &a $D %o %u %O %U .D /y /Y \o \u \O \U .E $p &A $P /o /u /O /U .F ~o %y ~O REPRESENTATION OF LETTERS FROM ISO 8859-2 WITH FORTRAN TABLE 4. 5. 6. 7. 8. 9. A. B. C. D. E. F. .0 $s #A $E /N *R 0 .1 *S *t a j @U A J 1 .2 $A $S /z b k s /U B K S 2 .3 *T $l c l t #U C L T 3 .4 $L *z d m u %U D M U 4 .5 *l @z e n v /Y E N V 5 .6 *L *Z /s /R f o w $T F O W 6 .7 /S @Z /A g p x &s G P X 7 .8 *s ^A h q y /r H Q Y 8 .9 $a i r z /a I R Z 9 .A %A %E *N ^a *c ^i ^o /u .B /L *E /O #a /e *d #o #u .C /C /I ^O %a $e $d %o %u .D $C ^I #O /l %e /n /y .E /E *D %O /c *e *n *r $t .F /Z *C $D $c /i /o @u REPRESENTATION OF LETTERS FROM ISO 8859-2 WITH CP500 TABLE 4. 5. 6. 7. 8. 9. A. B. C. D. E. F. .0 *r *R *l .1 /e /E a j $L A J 1 .2 ^a $e ^A $E b k s *L B K S 2 .3 %a %e %A %E c l t C L T 3 .4 /r *c /R *C d m u *S D M U 4 .5 /a /i /A /I e n v E N V 5 .6 #a ^i #A ^I f o w /s F O W 6 .7 /l *d /L *D g p x /z G P X 7 .8 $c *e $C *E h q y H Q Y 8 .9 /n &s /N i r z *z I R Z 9 .A /S *T $S $A *s $l .B *t $s @z ^o #u ^O #U .C $d /c $D @Z %o %u %O %U .D /y /Y *n @u *N @U .E $T /C $T /o /u /O /U .F /Z $a *Z #o %y #O FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 29-Mar-88 12:30:03-EST,3713;000000000001 Mail-From: SY.FDC created at 29-Mar-88 12:29:58 Date: Tue 29 Mar 88 12:29:58-EST From: Frank da Cruz Subject: For the digest... To: sy.christine@CU20B.COLUMBIA.EDU Message-ID: <12386232998.151.SY.FDC@CU20B.COLUMBIA.EDU> Date: Tue, 29 Mar 88 17:54:11 +0200 From: Andre PIRARD Subject: Proposed Kermit Rule for Extended ASCII Keywords: ASCII, Extended ASCII, ISO8859, Translation Tables In the process of implementing extended (national) characters transfer between micros and IBM mainframes, I came to the conclusion that, for the sole IBM PC, I had to build at least 9 different tables in order to support 3 EBCDIC tables (traditional and CECP 500 and 037) x 3 "ASCII" tables (table 437, table 850 and ISO 8859/1). Not considering ISO for the Macintosh, I've still got 3 tables to build for the IBM host and, if I endeavoured Mac to IBM PC conversion, 3 more tables or so. When we add more machine types, it all looks like the wheat grains on a chessboard problem. Not counting the added difficulty of knowing which is to translate what in what. Doesn't it look reasonable that each party deal with its own code problems and that the Kermit protocol rule what character code standard travels on the line as it already does for restricted ASCII? (That applies for text mode only, of course). I think ISO8859/1 is there for the purpose, with the added bonus that it keeps the 80-9F range free (but available for additionals if needed). This range is indeed the one that adds the largest overhead to 8th bit quoting. Similarly, ISO8859/1 should be used for terminal mode communication, at least as an option. This just involves byte to byte conversion in 8-bit wide mode and an additional SO/SI escaping (ISO 2022) mechanism in 7-bit mode. The same applies to non-Latin group users who should use their own 8859/x version similarly. [Ed. - Kermit was designed (in 1981) on the assumption that 7-bit ASCII was the most common representation for text files. In ISO terms, 7-bit ASCII (with control-character prefixing, etc) is the presentation-layer "transfer syntax" for text files. But now we have a proliferation of 8-bit ASCII character sets -- in addition to the IBM PC's, Apple's, and DEC's various incompatible extended ASCIIs, we have the ISO 8859 variations, and then the various translations between them and EBCDIC. In Japan, they face a similar problem. There are numerous character sets -- Katakana, Hiragana, Romaji, Kanji -- and there are numerous "standards" for representing each of these (especially Kanji) in the computer. Their solution was to modify the Kermit programs they use to "SET FILE TYPE TEXT ", putting the onus on the user to specify not only the file type but also the encoding. As Andre suggest, it would be best if there were one single transfer syntax for text files (at least for languages whose alphabets can be respresented in 8-bit characters), and each Kermit program translate between that and its own local code set. Is ISO 8859/1-1987 ("Latin Alphabet 1", = ANSI X3.134.2, = ECMA 94) a choice that won't offend anyone? The lower half (characters 0-127) corresponds to US ASCII (ANSI X3.4). If this proposal results in controversy, then does anyone have a simple alternative proposal? Meanwhile, it seems wise to build user-defined translation tables into Kermit programs, such as we have in MS-Kermit 2.30, and IBM mainframe Kermit 4.0. In MS-Kermit, it might also be desirable to extend the translation mechanism to file transfer, in some general, user-controllable way. Opinions?] ------- 30-Mar-88 08:54:30-EST,1543;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 30 Mar 88 08:54:23-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 30 Mar 88 08:54:43 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 5247; Wed, 30 Mar 88 08:54:41 EDT Received: by BITNIC (Mailer X1.24) id 4652; Wed, 30 Mar 88 08:53:46 EDT Date: Tue, 29 Mar 88 20:48:32 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John C Klensin Subject: RE: Turning the Tables: A Standards Problem X-To: ASCII/EBCDIC character set related issues To: Frank da Cruz Well, it may or may not be just history from the "old" ASCII Standard, but they are ALL that way. Every one. The current ASCII standard, the ANSI standards corresponding to ISO8859, the control code standards, and so on and so forth. And, yes, ISO has done "the same thing". And so has CCITT, where you will find character codes expressed as column/row. Perhaps it is really an artifact, not of "old" ASCII, but of "old" FORTRAN, which also addressed things in this order. In any event, better just get used to it; a "correction" would cause chaos. 31-Mar-88 10:15:54-EST,3967;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Thu 31 Mar 88 10:15:51-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Thu, 31 Mar 88 10:16:32 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6913; Thu, 31 Mar 88 10:16:29 EDT Received: by BITNIC (Mailer X1.24) id 9858; Thu, 31 Mar 88 10:15:22 EDT Date: Thu, 31 Mar 88 16:58:17 GMT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Matthias Melcher <$28%DHDURZ1.BITNET@CUVMA.COLUMBIA.EDU> Subject: Code Page Nationalities To: Frank da Cruz Dear list subscribers, the comparison of code pages can be simplified when we just think of them having a nationality or a "mother tongue", and some of them knowing foreign languages. The mother tongue of a code page is determined by one half of its character repertoire, a kernel which could be - entered on display terminals - coded with 7 bits - mapped onto the kernels of code pages of other nationalities simply by replacing the 14 "national use characters" That is the left half of an ASCII code page, and in EBCDIC the areas roundabout 4A-7F, 81-A9 and C1-F9. The difference of CP 037 and CP 500 is not "data processing oriented" vs. "word processing oriented" (Ed Hart), but: - CP 037 has US nationality - CP 500 has nationality "International", like 3274 Interface Code 14, and ISO 8859 itself. In that sense, "US"-ASCII must be regarded as International rather than US, and there is no real US ASCII code page (with e.g. Cent-sign in the left half). In the times when code pages did not speak foreign languages translations had to be done - either ignoring the graphic representations (e.g. exclamation point <-> right bracket, circumflex <-> logical-not) - or with foul compromizes (e.g. taking brackets AD/BD from TN-chain, but not braces 7B/8B). Today, if we want to respect the graphics we have the choice: (a) Map International ASCII (=ISO 8859) to International EBCDIC (= CP 500), i.e. kernel onto kernel (mother tongue) and extension onto extension (foreign languages). (b) Map International ASCII to national EBCDICs, e.g. US (= CP 037), thus intermixing kernel and extension. We must be aware that choice (b) logically consists of two translations: ASCII to EBDCIC and International to US, and this brings a lot of conceptual complexity and confusions which, in the long run, make communication cumbersome. Choice (a), on the other hand, bears many migration problems, especially as long as IBM has not completed its CECP support (like teaching PL/1 to recognize B0 as logical not, or teaching the 3174 to show and accept thorns). But I think this all will come. For example, the 3174 CECP RPQ 8Q0566 has been already shown at CeBIT Hannover fair and will be released as soon as some software corequisites are done. In the meantime, its not too difficult two deal with Code Page 500. Even on a US 3278 you can edit most of the CECP characters: just using CMS set output / set input with the old 5A-device codes. (We can send you a copy of the EXEC). I don't know which IBM representatives still recommend CP 037 to US users. The official recommendation explicitly states "Standardizing on a single code page for the entire network ..." "IBM recommends that if this is going to be done that the customer standardize using the International CECP code page." (8Q0566 announcmt) EARN/BITNET is an international network. So I think the code page has to be international as well, and every site must be able to send and accept mail in this code page. Mit freundlichen Gruen - Matthias Melcher 8-Apr-88 12:09:29-EST,2329;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 8 Apr 88 12:09:26-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 08 Apr 88 12:07:40 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6032; Fri, 08 Apr 88 12:07:37 EDT Received: by BITNIC (Mailer X1.25) id 6853; Fri, 08 Apr 88 12:06:10 EDT Date: Fri, 8 Apr 88 11:38:12 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edwin Hart Subject: Re: Code Page Nationalities To: Frank da Cruz In-Reply-To: Your message of Thu, 31 Mar 88 16:58:17 GMT The ISO 8859-1 character set is Latin Alphabet number 1 and has most characters needed for Western Europe. Eastern Europe uses a different version of ISO 8859. In discussing the differences between Code Pages 500 and 37, please understand that they contain exactly the same set of characters as ISO 8859-1. They were designed that way. However, the code points for most of the characters are different - each is a different code. Code Page 37 is a 192 character superset of the US/Canada English Data Processing 96-character code except for (square) brackets. Similarly, the 96 character subset of Code Page 500 characters match the ISO 646 / ANSI X3.4 characters. When translating characters from ISO 8859-1 codes to Code Page 500, for example, I believe that the characters should match. If the translation were from ISO 8859-1 to Code Page 37, the translation would be different. However, if we were considering another variation, ISO 8859-2, then I would expect IBM to provide another code page with the same character set as ISO 8859-2. I would expect that the IBM Code Page corresponding to ISO 8859-2 might require a different translation than the one defined by ISO 8859-1 to Code Page 500. If however, IBM would standardize on one code page for Latin Alphabet number 1, then the ISO 8859-x translation to IBM code page xxx could be held constant. That would be desirable. Ed Hart 8-Apr-88 23:15:10-EST,2676;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 8 Apr 88 23:15:04-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 08 Apr 88 23:13:14 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6853; Fri, 08 Apr 88 23:13:12 EDT Received: by BITNIC (Mailer X1.25) id 4371; Fri, 08 Apr 88 23:07:15 EDT Date: Fri, 8 Apr 88 12:10:00 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Barry D Gates Subject: Something I came across out in netnews-land... To: Frank da Cruz I thought the ISO8859 list might be interested in what this person has to say. I do not necessarily support the persons views, nor do I reject them (I'm not an expert on this, just an interested party). I'm interested in any opinions folks may have on this. I noticed from previous postings here that what we are generally referring to as ISO8859 is really ISO8859/1 for the Western European Nations. Does ISO intend to come out with a codepage for the languages this poster lists in his mailfile? Also are there any other EBCDIC mappings for the other codepages? Anyway, here is the person's posting. It was brought on in response to a posting on "interNational Language Support" on the HP computers I believe. The NLS has nothing to do with the SP5 IBM oddity, but more with what we are talking about here. ------ Forwarded MAIL from comp.std.internat: International Standards Newsgroup From: bas+@andrew.cmu.edu (Bruce Sherwood) Newsgroups: comp.std.internat Subject: Re: International Language Support Message-ID: <8WKYiky00UgCM600g4@andrew.cmu.edu> Date: 6 Apr 88 15:16:00 GMT Organization: Carnegie Mellon University Lines: 14 In-Reply-To: <691@kuling.UUCP> To repeat a major complaint I have about ISO 8859 (which I'm distressed to see is a component of NLS): This standard is based on nations rather than languages. So the West European version doesn't handle Welsh or Catalan or Esperanto (which don't have their own nations). The older standard, ISO 6937, was based on forty Latin-alphabet-using languages, not on nations. So it handled just about everything (except for Vietnamese) including Welsh and Catalan and Esperanto. ISO 8859 is a MAJOR step backward in terms of linguistic equality. Bruce Sherwood --------- End of forwarded mail. 9-Apr-88 16:19:02-EST,2716;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Sat 9 Apr 88 16:18:59-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Sat, 09 Apr 88 16:17:19 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 7347; Sat, 09 Apr 88 16:17:18 EDT Received: by BITNIC (Mailer X1.25) id 8422; Sat, 09 Apr 88 16:16:23 EDT Date: Sat, 9 Apr 88 16:08:00 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Chris Tanner <01696%AECLCR.BITNET@CUVMA.COLUMBIA.EDU> Subject: ISO 6937 and ISO 8859 To: Frank da Cruz This is a few remarks about ISO 6937 and ISO 8859 in reply to a recent mailing decrying ISO 8859. I am not an expert in this field, but from the little I know, here goes. ISO 6937 and ISO 8859 were developed by 2 different groups within ISO-IEC JTC1/SC2 for different purposes. ISO 6937 is designed for printers. It creates accented characters by the providing accented symbols which are actually no spaceing characters (these are found in the G2 set). Forinstance, e acute is created by the acute sign character plus e. It also includes the oe dipthong. This sort of thing is fine for printing but not very good for character string comparison and sorting. ISO 8859 is designed for use in programs (character string comparison and sorting). It provides separate characters for all the accented characters. It does not provide the oe dipthong since this is treated in string comparisons as O + E. There are 8 parts to ISO 8859. If people are interested, I can post to the list the title of each part, and the languages covered. SC2 has been asked by its member countires to achieve a harmonization between these 2 standards. This has resulted in a project proposal (Document JTC1 N 156) (balloting closes June 2, 1988) which is accompanied with a paper entitled Co-ordination of the Development of ISO 6937 and ISO 8859. It describes the purposes of ISO 6937, ISO 8859 and ISO 4873 (which specifies the rules and structure for 8 bit codes), the problems with these standards, and it proposes a structure for a family of graphic character sets for 8 bit coding. Hopefully this project will achieve its aim. By the way, the library/ information services people have a coding standard of their own which is similiar to ISO 6937 in some ways. Chris Tanner Atomic Energy of Canada My views are my own and not the views of my employer. 11-Apr-88 11:59:23-EST,1872;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Mon 11 Apr 88 11:59:18-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Mon, 11 Apr 88 11:30:37 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 8303; Mon, 11 Apr 88 11:30:35 EDT Received: by BITNIC (Mailer X1.25) id 8501; Mon, 11 Apr 88 11:29:52 EDT Date: Mon, 11 Apr 88 17:14:00 MET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: 6937/8859 To: Frank da Cruz Dear list subscribers Mr. Sherwood's remarks about ISO 8859 compared with ISO 6937 are unfair and untrue. In ISO 8859-3 one finds characters for Catalan, Esperanto, Galician, Maltese and Turkish. Lappish is in ISO8859-4. ECMA-94 contains all of ISO 8859-1,2,3,4 together. ISO 6937 is NOT a single byte coded character set. As a sequel to Mr. Tanner's comments, I attended the meeting of SC2/WG3 responsible for 6937 and 8859, 16-17 March 1988 in Paris. Work on 6937-5,6 will be discontinued, and DIS 6937-7,8 withdrawn. There will be ISO 8859-9, Latin alphabet no. 5, with Icelandic eth, thorn and /y replaced by Turkish g breve, s cedilla and dotless i / dotted capital I. The first draft of ISO XYZ (the harmonization) will appear in May. ISO 5426, the bibliographic coded character set is not (yet) included in the harmonization. Yours faithfully, Johan van Wingen FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 11-Apr-88 13:19:29-EST,1919;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Mon 11 Apr 88 13:19:27-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Mon, 11 Apr 88 13:17:50 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 8608; Mon, 11 Apr 88 13:17:49 EDT Received: by BITNIC (Mailer X1.25) id 0917; Mon, 11 Apr 88 13:17:09 EDT Date: Mon, 11 Apr 88 12:46:17 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edwin Hart Subject: Re: Something I came across out in netnews-land... To: Frank da Cruz In-Reply-To: Your message of Fri, 8 Apr 88 12:10:00 EST The problem with ISO 6937 is that none of the computer manufacturers support it. ISO 6937 indeed has all of the accents and with it you can form many characters. ISO 6937 comes from CCITT and the standard is concerned with teletext transmission. However, the computer manufacturers found it unacceptable because multiple bytes were used to form and store the characters. For example, to form an "a" with a circumflex, you did something like: strike the accent, then backspace, then the character. The manufacturers wanted to represent each character with one code - not some with three. ISO 8859-1 is also incomplete. For example, it also lacks the French "oe" diphthong. ISO 8859-1 is a compromise standard. I understand that when the compromise was reached, and the chairman asked if any more changes should be made, no one said anything; because if one started, then everyone would have "just a little change" and we would still not have a standard. Ed Hart 11-Apr-88 23:57:10-EST,7258;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Mon 11 Apr 88 23:57:07-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Mon, 11 Apr 88 23:55:32 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 9570; Mon, 11 Apr 88 23:55:31 EDT Received: by BITNIC (Mailer X1.25) id 7817; Mon, 11 Apr 88 23:52:55 EDT Date: Mon, 11 Apr 88 10:56:34 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Michael Sperberg-McQueen Subject: Single or Multiple Tables for Multiple Character Sets? To: Frank da Cruz If I understand his messages correctly, Johan van Wingen's main objection to the late discussions is that they are working on the assumption that equivalent graphics in different character sets should be mapped to each other. He objects that this would lead to multiple, incompatible translate tables being required. Edwin Hart voices the hope that IBM may be able to do for 8859 (in all its various parts) what they did for ISO 646 (in all its various national manifestations): set a translation or mapping by comparing two character sets (e.g. CP 37 or CP 500 and ISO 8859/1), and then define the various related EBCDIC code pages by applying that mapping to the other parts of 8859. (So a Greek EBCDIC code page will result from applying the CP500-ISO8859/1 translation to ISO8859/7, and so on.) That might not completely answer Mr. van Wingen's concerns, but it would be handy. I agree that it would be nice to keep the number of translate tables down, where feasible. But I fear that it's not feasible in the way Mr. Hart suggests, and I have a number of problems with Mr. van Wingen's idea that an arbitrary mapping should be defined, implemented *in hardware* (!), and stuck to come hell or high water. The fundamental question I have for anyone who will answer is: If we do not translate graphic-for-graphic, what is the point of translating? Why not define our mapping as the one-to-one mapping in which each hex code maps to itself? Or, to protect the control-code areas, why not just map ISO EBCDIC 0-1F 0-1F 80-9F 20-3F 20-7E 40-9E 7F FF A0-FE A0-FE FF 9F ? Obviously, this is *not* repeat *not* a serious suggestion. Why? because it would do no one any good at all. Similarly, applying the mapping given in the back of the VS Fortran 2.1 manuals to either ISO 8859/1 or any extended EBCDIC code page will give you code pages that contain all the necessary characters, but in an arrangement that no one at all supports. What good would data like that do anyone? The obvious desiderata for translate tables seem to be: - there should be as few as possible, preferably only one - they should translate characters correctly (i.e. graphic for graphic, with substitutions only where required) - they should preserve the collation sequence of the special characters or second alphabet in the code Equally obvious, no two of these are compatible. The US CECP does not preserve the collating sequence of ISO 8859/1, the usual EBCDIC version of the library character set does not preserve the collating sequence of the ASCII version of the same set (why?! does anyone know why?), and the mappings from ALA/ASCII to ALA/EBCDIC are not compatible. Similarly: ISO 8859/7 will define a Greek character set, and ISO 8859/8 a Hebrew character set. Without having seen either, I'll give ten to one odds against either set mapping to the EBCDIC character sets for Greek and Hebrew with the same translations as for 8859/1 and any IBM extended code page. ------------------------------------------------------------ What can be done? I don't know the answers, but it seems obvious that certain things *will* be done no matter what, and that others *can* be done if we here will agree to do them. I offer the following observations as one person's tentative assessment of the facts, probabilities, and hopes. First: if IBM can be persuaded to define one single EBCDIC for Western Europe without country variations, as proposed (I think) in the SHARE paper, they should do so. Second: a good character-to-character translation for one of the IBM extended code pages (37, 500, or some CECP) to ISO 8859/1 is going to be implemented a lot of places, along the lines of Howard Gilbert's posting and the various revisions to it. There is no point in trying to stop this, but we should, as Mr. Gilbert says, all agree and implement the same mapping, not different ones. To implement the same mapping, we should decide, even if IBM will not, on one EBCDIC code page to take as basic or common. Third: at sites with library automation systems, a translate table for the library character sets will also be implemented, or has already been. There is no point in stopping this, either, and it can't be stopped anyway. The hardware for library terminals defines the required table very rigidly, and the library automation systems don't have the flexibility to adjust to divergent translations. (Notis, at least, does know the difference between terminals with the library character set and terminals without -- but it *knows* what the library character set is, and cannot readily be told different.) Fourth: although some protocol converters have the memory required for multiple translate tables (e.g. Series/1s), others (e.g. 7171s) don't. Those running 7171s may be able to fit one or even two alternate tables into their 7171s, but not more. And we can only fit one: the rest of the room is taken up by local terminal types. So we are going to have to choose: if you can only support ONE extended-character-set translate table, which is it going to be? Obviously, I think it should be the one we here agree on, if we here can agree on one. But for library machines, it's going to have to be the one defined by the library code pages. Fifth: how can we support the other required translations if we cannot put them into our protocol converters? Matthias Melcher has the best idea I've seen: we use the CMS SET INPUT and SET OUTPUT commands to simulate the translate tables actually needed by the users. I have only done a little work on this, but my experiments so far seem to show that it can work. SET INPUT and SET OUTPUT, on the other hand, only work for terminal support. For file transfer, we are going to have to have execs which will post-process files uploaded with a given translate table, and re-translate them into the proper code page. That shouldn't be too hard with Rexx under VM. What other operating environments can do, I don't know. All of which is just one person's private opinion. Please contradict me where I am wrong. Michael Sperberg-McQueen, University of Illinois at Chicago 12-Apr-88 00:56:38-EST,6367;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 12 Apr 88 00:56:33-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 12 Apr 88 00:54:57 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 9635; Tue, 12 Apr 88 00:54:55 EDT Received: by BITNIC (Mailer X1.25) id 8067; Tue, 12 Apr 88 00:51:49 EDT Date: Mon, 11 Apr 88 23:31:39 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: "Bryan, Jerry" Subject: Single or Multiple Tables for Multiple Character Sets? To: Frank da Cruz In-Reply-To: Message of 04/11/88 at 10:56:34 from U18189@UICVM I have refrained from saying anything to this list so far, mostly out of fear of saying something foolish. It remains to be seen how well founded that fear is, but I have finally decided to put my two cents worth in anyway. Speaking of which, I am writing this on a PC running a VT100 emulator going into an IBM system through a 7171 protocol converter, so I would have a hard time putting in a cent sign (ignoring going into XEDIT hex mode), and even it I did, many of you would have a hard time seeing the cent sign on your terminal anyway, which I suppose is the whole point of this list. (Note to Europeans about American slang, "two cents worth" is an opinion that may not be worth very much (or may -- it is up to the listener to decide). The speaker is specifically disclaiming any great profundity by using that phrase.) Anyway, I have the feeling that there is too much emphasis on EBCDIC-ASCII conversion and not enough emphasis on straightening out EBCDIC and straightening out ASCII *as separate problems*. The problems of EBCDIC are legion and well documented -- the characters needed by C and PASCAL, national characters, etc. ASCII has the same problems of missing characters and national characters, and is even worse than EBCDIC in some ways because it historically has been only a 7-bit code. I offer the following suggestions. 1. 7-bit ASCII is a lost cause. I realize there will be 7-bit ASCII well into the next century, but we would do well to concentrate on 8-bit ASCII and getting it right. One could argue that 8 bits are not enough, either, but 7 bits are hopeless. 2. Proper graphic-to-graphic mappings *within EBCDIC* and *within ASCII* are vital. To the maximum extent possible, "proper" means "it looks the same, no matter what". This goal really cannot be achieved in 8 bits, but it should be the goal, nevertheless. I have had the opportunity which many Americans do not have of living in Europe. It was irritating to receive messages from America and have characters not look right. For that matter, even locally written things -- SAS programs using dollar signs, for example -- looked awful. I lived in Norway, which to the eye of an American has a curious looking alphabet, but one adapts. However, I once traveled to Germany and received messages from Norway in Norwegian on a German terminal. My Norwegian is not all that good anyway, but it was doubly hard when many of the Norwegian characters were rendered as German characters. Now, I have the same problem in America with Norwegian I receive here. It seems to me that constancy of graphic rendering ought to be one of the highest, if not the highest goals, even though the goal cannot really be achieved in 8 bits if enough languages are considered. (For that matter, wouldn't it be nice to prepare something on a word processor with italics, send it out over BITNET, and have your italics characters appear as italics on the recipient's screen?) The infamous problem of the square brackets on the TN print train is one of many examples inconstancy graphic rendering in EBCDIC, but as noted on this list and SHARE and SEAS papers, there are many others. 3. Having said all that, rather at too much length, then let me further suggest that the same constancy of graphic rendering ought to be a goal of ASCII-EBCDIC conversion. All the problems I mentioned in item 1 were EBCDIC-EBCDIC problems with communications between IBM VM/CMS systems. I also communicated between IBM and VAX systems in Norway and to the rest of BITNET, and the lack of graphic constancy gets even worse when ASCII is introduced into the equation. 4. CLearly, graphic constancy implies that both ASCII and EBCDIC be as rich as possible in characters, and that the national character idea is not a very good idea. An American dollar sign and a British pound sign, for example, need to be *standard* in ASCII and EBCDIC, but there are numerous other examples such as Western European umlauted, accented, and dipthonged characters. Last I heard, the backslash was a national character -- not so nice for the folks writing in C. However, all this still comes back to even 8 bits not being enough (Greek? Hebrew? Russian? Japanese? Arabic? etc.) 5. Finally, graphic constancy implies that ASCII-EBCDIC conversions be fully reversible in both directions. This is a part of my distaste for even dealing with 7-bit ASCII, where full conversion to/from EBCDIC is clearly impossible. I will finish by noting that I have tangled with this problem for years, and never cease to be amazed by how difficult it is. It always *seems* like it ought to be easy, but somehow it never is. Something always gets you. I have had users editing IBM PL/1 code on a VAX, submitting batch to an IBM machine, for example. How do you handle that? Etc. etc. etc., as many other people have pointed out. Be suspicious of anybody who doesn't understand why the problem is not trivial, and who submits his own set of translate tables to prove it. 12-Apr-88 01:38:46-EST,1519;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 12 Apr 88 01:38:41-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 12 Apr 88 01:37:04 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 9645; Tue, 12 Apr 88 01:37:03 EDT Received: by BITNIC (Mailer X1.25) id 8504; Tue, 12 Apr 88 01:36:10 EDT Date: Mon, 11 Apr 88 22:27:00 PDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Leonard D Woren Subject: Two more cents worth... To: Frank da Cruz I'm a hardcore IBM bigot, but (with some help from coworkers), I've come to the realization that EBCDIC's design doesn't make sense. The alphabet isn't contiguous, and numerals sort higher than letters. ASCII makes certain programming tasks much simpler by not having either of these defects, which date back to EBCDIC's ancestry in BCD, which was based on punch card codes. This isn't intended to start a war of words, so no flames please... This is just mentioned as something to think about: It may be heresy, but maybe the answer is to add some characters to 8 bit ASCII and throw out EBCDIC. (Yes, I realize how much work a conversion would be.) 12-Apr-88 10:09:56-EST,1482;000000000000 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 12 Apr 88 10:09:52-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 12 Apr 88 10:08:15 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0194; Tue, 12 Apr 88 10:08:11 EDT Received: by BITNIC (Mailer X1.25) id 5511; Tue, 12 Apr 88 10:07:38 EDT Date: Tue, 12 Apr 88 09:11:02 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edwin Hart Subject: Re: Two more cents worth... To: Frank da Cruz In-Reply-To: Your message of Mon, 11 Apr 88 22:27:00 PDT Throw out EBCDIC? That clearly is one of the options. However, I believe that this would be a larger conversion effort than to IBM Country Extended Code Pages like 37 v1 or 500 v1. Also, ISO 8859 does not have a contiguous alphabet because the accented characters are in the upper half of the table so you have not fixed one of the problems. As soon as you have to be concerned with accented characters and different collating sequences between countries, then sorting becomes much more difficult (because it depends on both language and country). Ed Hart 12-Apr-88 10:32:00-EST,4592;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 12 Apr 88 10:31:57-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 12 Apr 88 10:30:20 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0262; Tue, 12 Apr 88 10:30:18 EDT Received: by BITNIC (Mailer X1.25) id 6198; Tue, 12 Apr 88 10:23:54 EDT Date: Tue, 12 Apr 88 09:08:44 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Howard Gilbert Subject: Re: Single or Multiple Tables for Multiple Character Sets? To: Frank da Cruz In-Reply-To: Message of Mon, 11 Apr 88 10:56:34 CDT from It is important in this discussion to realize that characters set issues are not separable from the purpose of the data or communications. Some people mix the question of ASCII data and ASCII terminals without proper thought. When an ASCII terminal is connected to a protocol converter to emulate a 3270 display, it is doing device emulation and not character translation. You press "A" and get an "A" on the screen and at the host. This is not a case of simple character translation of X'41' into X'C1'. If you are in APL mode, sending X'41' will be translated into APL alpha and lowercase "a" goes to uppercase. If "A" is embedded in an ESC sequence, it is interpreted and not translated. What, after all, is the "EBCDIC" meaning of PFK 4 (answer: it is an AID and not a character). Thus the objective of the 7171 is to translate the KEY marked "A" into EBCDIC and not the code. Turn on the Dvorak keyboard mode and see what happens then. This then follows into all of the remaining discussion. We recently were disturbed to note that Notis cannot find the city of Lodz in Poland. The problem is that the L is stroked (overtype L and /). Stroke L is an ALA special alphabetic (along with D bar, O /, and U hook). It has lowercase and uppercase forms. The problem is that ordinary library users do not have the special alphabetics and type in ordinary "L" and Notis does not alias approximately homographic alphabetic characters when doing a search of the database. Note that Notis will match on "odz" but not "Lodz". Worse, our database is inconsistent in its handling of AE and OE dipthongs. In a large number of cases they are typed as two characters rather than using the dipthong code. Again, database searching is a problem. There are around 500 characters (including all diacritically marked forms) enumerated in the ANSI Z39.47-1985 standard for 35 Roman languages and 51 other Romanized forms of languages. Unfortunately, this does not include the Hebrew, Cyrillic, and Arabic alphabets let alone the Far East. In its most general form, the problem cannot be solved. It can be solved FOR PARTICULAR APPLICATIONS. Not all applications will find the same solution optimal. The purpose of the committee is to find if there is one solution which is applicable to a large enough family of applications to warrant general acceptance. There are some who would argue that we are looking for a single common translation. I would prefer to believe that we are looking for a single preferred translation for the bulk of use. Just as ISO 8859 itself cannot replace ANSI Z39.47 and stay within the 8 bit limit of available graphic code points, so any ASCII to EBCDIC translation which is generally suitable for Data and Word Processing will still fail to address math-technical, traditional TN (box drawing), APL, and other common code problems. I have separately held that we need to make an accompanying recommendation that ASCII-EBCDIC (and EBCDIC-EBCDIC) be systematically addressed in operating systems, data management subsystems, and communications subsystems. This will allow organizations to develop standardized approaches to special needs which are not addressed by the common translate table. In essence, the idea of a single common table is too easy a way out for IBM. They have undertaken to define such a table on at least two previous occasions. Changing 512 bytes of table space is rather easy. Addressing the general question of codes, alphabets, collating sequences, and the like is a much larger and expensive project. 12-Apr-88 11:20:07-EST,3401;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 12 Apr 88 11:20:04-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 12 Apr 88 11:18:28 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0393; Tue, 12 Apr 88 11:18:26 EDT Received: by BITNIC (Mailer X1.25) id 8035; Tue, 12 Apr 88 11:17:35 EDT Date: Tue, 12 Apr 88 10:14:19 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: "Bryan, Jerry" Subject: Two more cents worth... To: Frank da Cruz In-Reply-To: Message of 04/12/88 at 09:11:02 from HART@APLVM >Throw out EBCDIC? >That clearly is one of the options. However, I believe that this would be >a larger conversion effort than to IBM Country Extended Code Pages like >37 v1 or 500 v1. Also, ISO 8859 does not have a contiguous alphabet because >the accented characters are in the upper half of the table so you have not >fixed one of the problems. As soon as you have to be concerned with >accented characters and different collating sequences between countries, >then sorting becomes much more difficult (because it depends on both language >and country). Notwithstanding the problems listed above, I think that throwing out EBCDIC might ultimately be the way to go, and I, too, am a lifelong IBM bigot. Throwing out EBCDIC right now is clearly unthinkable. But consider the following. Suppose this process of rationalizing EBCDIC and ASCII succeeds to the point that there is a well defined graphic to graphic mapping and also a well defined and fully reversible 256 code point to 256 code point mapping established. At that point, aside from such trivial little problems as old data, old hardware, sorting and collating sequences, etc., isn't the mapping between code points and graphics somewhat arbitrary? And if the mapping is somewhat arbitrary, why not standardize on ASCII? (An irrelevant aside on the sorting problem: I do not know how the accented or umlauted characters are sorted in German or French, for example. But Norwegian has one curious sorting problem which I think no coding will solve completely. The 29-th letter in the Norwegian alphabet has two different graphics renderings. One is as a double "A" -- "aa" in lower case and "AA" in upper case. This is not a dipthong, it is a *single* letter rendered as two characters. The other graphics rendering is as an "A" with a circle over it. The "AA" must be sorted as if it were a single letter in the 29-th position of the alphabet, even though it is represented as two A's in computer memory. The "A-with- circle-over-it" is also sorted as if it were a single letter in the 29-th position of the alphabet, and it is represented as a single character in computer memory. But the two distinct graphics renderings must be maintained, so people's names will not only be spelled correctly (either graphics rendering is a "correct" spelling), but also will *look* right. Are there any other sorting problems which anybody knows about which are this severe?) 12-Apr-88 12:14:23-EST,3118;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 12 Apr 88 12:14:18-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 12 Apr 88 12:12:36 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0518; Tue, 12 Apr 88 12:12:35 EDT Received: by BITNIC (Mailer X1.25) id 9299; Tue, 12 Apr 88 12:11:50 EDT Date: Tue, 12 Apr 88 10:25:07 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Michael Sperberg-McQueen Subject: Code translation, device emulation, Notis, and Sorting To: Frank da Cruz Howare Gilbert corrects, quite rightly, my oversimplified discussion of the issues -- I will say that *within graphics strings* what is happening sure seems like code translation to me, but he is right that terminal emulation, data conversion, and so on are enough different that we need to keep the differences present in mind as we discuss them. His example of searching problems is a good one, but lest a false idea of Notis's capacities become widespread I should point out that at UIC we have no trouble finding Lodz (or &odz, as it appears in most of the records) by searching on 'lodz' -- our indexes never contain diacritics. Something other than Notis must be the problem. Jerry Bryan inquires about analogues to Norwegian's 'aa' and 'a' -- I know only of two. The 'ij' digraph in Dutch is sorted by itself, and the sharp s (esszett, w) of German sorts identically to 'ss', without being (in Germany) the same thing at all. (In Switzerland, sharp s is no longer used, and some Swiss refuse to believe me when I say it is still used in Germany and Austria.) But diacritics and umlauts also cause problems that no diddling with collation sequence can solve. Lists of words in French and German have, effectively, two sort keys: they are sorted first on the base characters without regard to the diacritics, and secondarily on the diacritics. (N.B. I am describing the practice taught me in class, and the practice I observe in dictionaries. Perhaps one of the European list members can say how diacritics are typically handled in data-processing sorts.) The concordance packages built for literary and linguistic study, therefore (e.g. WatCon and the Oxford Concordance Package) have special sort facilities to prepare sort keys for the sorting. But N.B. Howard Gilbert is quite right that sort sequence depends both on language and on country. Umlauts follow 'z' in Swedish and Modern Icelandic -- so books on Old Norse printed there sort them after 'z'. Books on Old Norse printed in England and North America tend to sort them either that way or in the German fashion. Same goes for edh and thorn. Michael Sperberg-McQueen, University of Illinois at Chicago 12-Apr-88 13:02:15-EST,4248;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 12 Apr 88 13:02:11-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 12 Apr 88 13:00:32 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0610; Tue, 12 Apr 88 13:00:30 EDT Received: by BITNIC (Mailer X1.25) id 0349; Tue, 12 Apr 88 12:58:48 EDT Date: Tue, 12 Apr 88 09:45:56 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John Kesich Subject: Re: Two more cents worth... To: Frank da Cruz In-Reply-To: Message of Mon, 11 Apr 88 22:27:00 PDT from I have no doubt you will catch a lot of flames from people who don't understand the issues, I for one think you are on the right track. As I see it the whole EBCDIC/ASCII/ISO8859 issue is really 4 closely related problems: 1) providing the means to input/output characters which are meaningful to the underlying host but are not available on the terminal, printer, etc. being used 2) inter-site communication (ISO8859 should be adopted as the standard code for all such communication) 3) ASCII-ISO8859 migration, while the people on this list may not be too concerned about this particular problem, some of us do have to 'push from the other side' as well. As yet, I don't know of any group which is pushing for UNIX support of ISO8859, for example. 4) EBCDIC-ISO8859 migration - I'll discuss this at some length. Suppose for a moment that EBCDIC code pages for each of the ISO8859 family of codes were adopted, the end result would be that IBM's would be using ISO8859 with a different collating sequence. We would also be saddled with the needless waste of 'translating' characters between the two indefinately. I realize that there are MANY applications which currently use EBCDIC, and I am not proposing to simply scrap them. What I am proposing is that IBM provide a means for users to migrate at their own pace from EBCDIC to ISO8859. There will no doubt be those who say why should IBM switch to ISO8859 why doesn't everyone else switch to EBCDIC. The answer is three-fold. ISO8859 is a better code, it was designed for computers not inherited from TAB equipment. ISO8859 is an internationally accepted standard, it may not be perfect but everyone has agreed to use it. IBM itself is straddling the EBCDIC-ASCII divide (PC, PS/2, AIX). So, if we want a standard, ISO8859 is the one to pick. Few people are probably aware that when the 360 was first introduced bit 12 of the psw determined whether the machine was using ASCII or EBCDIC. I remember reading an interview with one of the developers a while ago in which he stated that this feature was dropped because 'no-one wanted it'. What a pity, if the business DP centers of 20 years ago had had a bit more foresight this whole mess might have been avoided. (Actually, I think the reason users didn't want IBM's ASCII support had to do with the way IBM defined ASCII - I came across an old copy of Principles of Operation which describes it. The characters themselves were pretty much as they are now, but the bits were laid out strangely: 76X54321 where X was 0 if the character was < @ and 1 if > ?.) IBM should introduce an EBCDIC/ISO8859 option. Eventually EBCDIC would then go the way of card readers and 7-track tapes, and future programmers would be able to marvel at how quaint this whole situation was. Even if it took 20 years, eventually we would be rid of the problem. If IBM does not migrate to ISO8859, how long will it be before there is a mailing list to discuss the problems of having 2 collating sequences? If IBM is going to migrate to ISO8859 then now is the time to start planning for it. A final question: is there any real benefit to having 2 distinct code families which I have overlooked? 12-Apr-88 13:28:02-EST,1883;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 12 Apr 88 13:27:55-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 12 Apr 88 13:26:10 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0654; Tue, 12 Apr 88 13:26:09 EDT Received: by BITNIC (Mailer X1.25) id 0461; Tue, 12 Apr 88 13:06:41 EDT Date: Tue, 12 Apr 88 13:00:07 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: "John F. Chandler" Subject: Re: Single or Multiple Tables for Multiple Character Sets? To: Frank da Cruz In-Reply-To: U18189%UICVM.BITNET@CUVMA.COLUMBIA.EDU message of Mon, 11 Apr 88 10:56:34 CDT I don't subscribe to this discussion list, but I was sent a copy of this one posting. My perspective is two-fold: file transfer and connection of ASCII terminals to IBM mainframes. In a way, the 2nd is just a special case of the first -- there is a tremendous corpus of files that have been typed in over the years. I will restrain my skepticism for the moment and assume that a single standard can be (A) agreed upon in the present forum and (B) acted upon elsewhere. That leads to my main point: having gone through this whole argument in the context of CMS Kermit, I have come to the conclusion that, once a site settles on a single translation scheme, that scheme should be built into any and all file transfer mechanisms used there. Kermit, for example, offers a tailorable A-to-E table (on the mainframe side), which can embody any mapping you care to define. 12-Apr-88 14:15:11-EST,2194;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 12 Apr 88 14:15:06-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 12 Apr 88 14:13:28 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0808; Tue, 12 Apr 88 14:13:26 EDT Received: by BITNIC (Mailer X1.25) id 1187; Tue, 12 Apr 88 14:12:27 EDT Date: Tue, 12 Apr 88 13:39:46 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: "Bryan, Jerry" Subject: Two more cents worth... To: Frank da Cruz In-Reply-To: Message of 04/12/88 at 09:45:56 from KESICH@NYUCIMSA >As I see it the whole EBCDIC/ASCII/ISO8859 issue is really 4 closely >related problems: .... text deleted.... > 2) inter-site communication (ISO8859 should be adopted as the standard > code for all such communication) This is a most interesting suggestion. It would mean, for example, and if I interpret it correctly, that two IBM EBCDIC machines communicating with each other would use ISO8859 rather than EBCDIC over the communications path. This sort of suggestion is philosophically in line with standards emerging in other areas where there is an interchange standard for graphics, for word-processing style text, for CAD/CAM drawings, etc., where the interchange standard does not dictate how the data is stored in the computer, so long as the machine can convert from its internal representation to the interchange standard and back. If this idea were carried far enough, it could possibly become the basis for a long term (30 year?) conversion plan to ISO8859 for everything. For example, one could view reading or writing to a tape or a disk as an "interchange", so one could read or write an ISO8859 tape or disk into an EBCDIC machine or vice versa with smart controllers performing an "interchange" rather than an I/O. 12-Apr-88 17:25:50-EST,1620;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 12 Apr 88 17:25:45-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 12 Apr 88 17:23:54 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 1089; Tue, 12 Apr 88 17:23:50 EDT Received: by BITNIC (Mailer X1.25) id 3729; Tue, 12 Apr 88 17:23:03 EDT Date: Tue, 12 Apr 88 16:07:05 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John Kesich Subject: Re: Two more cents worth... To: Frank da Cruz In-Reply-To: Message of Tue, 12 Apr 88 13:39:46 EDT from >> 2) inter-site communication (ISO8859 should be adopted as the standard >> code for all such communication) > This is a most interesting suggestion. It would mean, for example, and > if I interpret it correctly, that two IBM EBCDIC machines communicating > with each other would use ISO8859 rather than EBCDIC over the communications > path. In theory, yes they would use ISO8859. In practice they could continue to use EBCDIC between them so long as all data being passed through the link between 3rd parties was correctly mapped from and then back to ISO8859. But backbone network nodes could avoid all the translation overhead by working strictly in ISO8859. 13-Apr-88 17:56:19-EST,1818;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 13 Apr 88 17:56:18-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 13 Apr 88 17:54:42 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 2584; Wed, 13 Apr 88 17:54:39 EDT Received: by BITNIC (Mailer X1.25) id 7127; Wed, 13 Apr 88 17:53:41 EDT Date: Wed, 13 Apr 88 17:23:43 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: "User Services, DCS Paul Henderson" Subject: Re: Two more cents worth... To: Frank da Cruz In-Reply-To: MAIL of Wed, 13 Apr 88 10:34:49 +0300 >To set the record straight, bit 12 of the PSW controlled the generation of the >sign in the results of decimal arithmetic computations. The bit was labelled >ANSI (rather than ASCII) because the "standard" plus sign was X'A' and the >minus sign X'B' rather than X'C' and X'D' respectively, which were IBM's At the risk of sounding like a nit-picker -- I just happen to have a Principles of Operation, Form A22-6821-7, dated September 1968. On page 71 it discusses the PSW: ASCII(A): When bit 12 of the PSW is one, the codes preferred for the USASCII-8 code are generated for decimal results. When the PSW is zero, the codes preferred for the extended binary-coded-decimal interchange code are generated. Perhaps the meaning of the bit was changed to ANSI before it was dropped but for some of us, it really was the ASCII bit. We never used it either. 14-Apr-88 09:33:27-EST,5133;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Thu 14 Apr 88 09:33:24-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Thu, 14 Apr 88 09:31:51 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 3213; Thu, 14 Apr 88 09:31:49 EDT Received: by BITNIC (Mailer X1.25) id 3982; Thu, 14 Apr 88 09:30:24 EDT Date: Thu, 14 Apr 88 14:55:00 MET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: national versions To: Frank da Cruz Dear list subscribers Let us stop discussing red herrings like "Throw out EBCDIC". If you ever have attended a SHARE or SEAS meeting you ought to know what is needed for putting a requirement to IBM. Let us also stop using imprecise terminology, like ISO8859 without further qualification, or ASCII meaning an 8-bit code. Because of this Mr. Kesich's letter is incomprehensible to me. There is no single ISO8859, and ASCII is not identical with ISO646, both being 7-bit codes. 8-bit ASCII does not exist under this name. The real problem for both EBCDIC and ISO is that of the unique graphic- code correspondence. Risking to bore those who know, some little tutorial is appropriate. First "ASCII". ISO646 (1st ed. 1973, 2nd ed. 1983) specifies 7-bit codes for characters. Of the 128 possible codes 33 are for control characters, one for SPACE, and 94 for graphic characters. Of this last group 82 are unique, 12 are left open. For completing the set defining a "national version" of ISO646 is required. An International Reference Version (IRV) is provided where none is preferred. ASCII is simply the US National Version of ISO646. It differs only from IRV in having a $ instead of the currency sign. But German, Swedish, Danish/Norwegian and others substitute accented letters at most of the 12 places. What does this mean? If you send square brackets to Norway, they arrive as AE and A-ring (braces as ae and a-ring). This practice puts a barrier between the English and the non-English speaking world, caused by the number of characters being limited to 94. An 8-bit code could be a solution, by adding 96 codes. But even then, not every character can be accomodated in a unique way. ISO 8859-1 provides 190 unique characters used in Western Europe. This shifts the barrier to about the Iron Curtain, leaving Greece and Turkey at the wrong side. Now, if you send a Turkish text from Ankara to Washington, it arrives with Icelandic eth's and thorns, inserted into the Turkish words. If you are not too concerned about excluding a NATO member from the Western civilisation, ISO8859-1 is certainly an improvement. There is a next step but - at a price. If we take two bytes for every character we can accomodate much more, even Chinese. Only a few mandarins who happen to know 80000 Chinese characters would be disappointed. The design of this is the subject of SC2/WG2, meeting this week in Boston. Thus, at some time, there may be an ISO standard for it. I am interested to hear opinions on this idea. As for EBCDIC the situation at present is comparable, only there are 14 positions available for national versions. You find the story in "3270 system - Display and Printer I/O Interface Codes", Figure 10-43. Still worse is the collection of horrors in "IBM Displaywriter Host Attach Programming Guide". p. 5-3 to 5-33. It also includes the EBCDIC/7-bit code correspondence. There is also mentioned a distinction between EBCDIC/Multilingual, EBCDIC/DP and EBCDIC/WP. So far for the tutorial. If we want the order of things changed, we can know what they are. But there is an important difference when attempting changes. The ISO standards are produced by international Working Groups, and approved by Subcommittees and Technical Committees, National Member Bodies voting. But in many countries it is not difficult to get into their panels, provided that you know your stuff, and are prepared to do a lot of work, and attend the meetings. With EBCDIC changes are an IBM management decision that can be only to a certain extent be influenced by SHARE, SEAS or other groups' requests, and often at a stage that is too late. Even the defining document for EBCDIC is hard to obtain (it exists, it is IBM Corporate Standard, CSS 3-3220 002, that is to say my copy that dates from 1970, and will have been modified certainly since then). So, we should not only talk about what to agree, but also about the way to achieve it. I hope that my contribution to our list has been constructive. Let us shed our tears elsewhere. Yours faithfully, Johan van Wingen FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 14-Apr-88 12:17:59-EST,4482;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Thu 14 Apr 88 12:17:53-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Thu, 14 Apr 88 12:16:17 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 3540; Thu, 14 Apr 88 12:16:13 EDT Received: by BITNIC (Mailer X1.25) id 6936; Thu, 14 Apr 88 12:12:39 EDT Date: Thu, 14 Apr 88 09:27:22 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Michael Sperberg-McQueen Subject: Query about implementation and use of ISO standards To: Frank da Cruz In discussions of character sets, I occasionally encounter people (in person or by their writings) who wonder what all the fuss is about, since after all ISO 2022 defines a perfectly adequate method of switching from one seven-bit character set to another (SO) and back (SI) -- and moreover also defines methods for specifying, as part of the data stream, which character sets are being used as G0 and G1 sets. (By means of escape sequences assigned by ECMA, acting for ISO as a registration authority, to all registered sets.) The obvious reason for the fuss, of course, is that ISO 2022 defines a method, but does not provide the hardware or software to implement the method. And I wonder -- do the hardware and software exist? So my questions are these: 1 how common (in the experience of this group, either in North America or elsewhere) are terminals or terminal emulation programs which accept and handle SI/SO character set switching? I can think of: - Datamedia APL terminals (I have read that most ASCII APL terminals do SI/SO, but I have never encountered any but DM) - IBM 3163, 3164 terminals, with or without the ALA cartridges - Yterm (beginning with version 1.3) and that's it for me. Are there others? 2 how many of these terminals can handle G1 graphics other than the built in set? In my limited experience, only two: the IBM 3163/4 and PCs -- if they have EGA or Hercules Graphics Plus or Quadvue cards. (Or any PS/2 with a VGA.) 3 how many devices of any type can choose the character set they use on the basis of the registered escape sequence for a set? I don't know of any at all. Is that supposed to be what happens with ISO 2022, or is it intended that software somewhere along the way will see the registered escape sequences and translate them into control sequences that will set the terminals or printers correctly? If the recognition of registered escape sequences is supposed to happen in software, then has anyone ever written, used, seen, or heard of software that does this? 4 (forgive my ignorance, I have used mostly IBM mainframes) Do the file structures and utilities of ASCII operating systems and editors always / usually / sometimes / ever allow escape sequences like those prescribed by ISO 2022 to be embedded in files? Or will the communications link see the escape sequence when it comes in from a terminal, try unsuccessfully to parse it, and discard it? For that matter, can ASCII systems embed the SI and SO in the file? (Or IBM systems? Yes, I know about SET HEX ON and ALTER in Xedit, but are there simpler ways?) In sum -- my own experience is that SI and SO are useful and (now) possible, between a mainframe host and a terminal where both know in advance what character sets are to be used. I have now seen this convention actually used in terminal-to-host communication, this year for the first time (long after first reading about it). But -- while it seems equally useful to be able to identify character sets by the use of escape sequences embedded in the data stream, I have still (fifteen years or more after ISO 2022) never seen in use or heard of as ever being used. Is it used? Or is it a nice idea that no one has implemented, as ISO 6937/2 appears to be? Since there seems no point in burdening the list with replies saying "Nope, I haven't ever seen it either," replies can be sent to me, U18189 at UICVM, and I will post results, if any, to the list. Michael Sperberg-McQueen, University of Illinois at Chicago 14-Apr-88 12:46:25-EST,4870;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Thu 14 Apr 88 12:46:19-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Thu, 14 Apr 88 12:44:42 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 3584; Thu, 14 Apr 88 12:44:40 EDT Received: by BITNIC (Mailer X1.25) id 7306; Thu, 14 Apr 88 12:43:19 EDT Date: Thu, 14 Apr 88 11:24:08 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John C Klensin Subject: RE: national versions X-To: ASCII/EBCDIC character set related issues To: Frank da Cruz I agree with Johan's comment, but want to add one observation and reinforce another. 1) Changing EBCDIC: By and large, the way IBM makes decisions like this are dictated by "marketing considerations" much more than by, e.g., requests from the likes of SHARE or SEAS. Since the COBOL debacle and the environment that spawned it ("we have these three zillion lines of code that we would have to change, and that would cost us umpity-ump kilos of gold and pounds of flesh..."), "marketing considerations" has often been an abbreviation for "we are just not going to make incompatible changes if they are going to disrupt our installed customer base". That is, I want to stress, a reasonable position, but it makes "drop EBCDIC internally" about as realistic (maybe less so) than "it would be really nice if the 370 supported a hardware stack architecture". The whatever-you-like-internally, Standard character sets in interchange, approach is actually realistic, but this is not the right place to debate it. Things are moving in that direction anyway, but, if you are going to have that plan, then you are going to need to convert the graphics. Somehow. Which is where this discussion should probably focus. That said, the problem is a little worse than Johan's description. First of all, the national member body representatives of some of the countries with non-alphabetic languages stood up at an ISO/IEC JTC1/SC22 (programming languages) meeting last fall and indicated, among other things, that "multiple byte" might well need to be more than two. Second, they want the multiple byte sequences *embedded* in single-byte sequences and vice versa, and got an SC22 vote imposing support for exactly that requirement on any future programming language standardization. They appear to feel that translation from a data stream that has both sets of characters and escapes into an internal representation that uses a single (adequately long) length is unacceptable, or only marginally acceptable - at least in part because of the space requirements. Consider the programming language implications of the usual "how long is that string in 'characters'" and "are these two strings equal" operations. Please do not start a discussion on this topic here, just think about it as part of the background to any "solutions" you propose. Now, the other thing that has slipped past in the flurry of messages is that there are ISO standards finished or under development that permit switching character sets midstream. I can, in principle, send out a stream of characters in ISO8859/1 (Latin alphabet 1) and insert a control sequence somewhere that says "here comes ISO8859/8", and then send some characters which I want interpreted according to the latter set of graphic mappings. Then I can switch back, or switch to a third registration set. Now, one can perfectly well design a system that responds to those "switch sets" controls with "I can't deal with that nonsense", or one can be prepared to handle all of them. But, if you take the first position, you are better off than you were with national variants of ISO646 only in that you *know* that you can't interpret the characters correctly, rather than thinking that they represent your own variant. But, if you decide to cope with a character set switch, then you need to worry about the EBCDIC code pages, or other variations, to deal with the entire range, and how you are going to switch between them (so much for "one network, one conversion standard" or even "one host, one conversion standard", at least for simple versions of "conversion standard"). Again, this is not an attempt to send people off in another irrelevant direction -- I would strongly discourage that -- but let's also skip the simplistic "solutions". They either won't work now, or won't work for very long. 15-Apr-88 14:34:57-EST,3296;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 15 Apr 88 14:34:54-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 15 Apr 88 14:33:07 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0759; Fri, 15 Apr 88 14:33:03 EDT Received: by BITNIC (Mailer X1.25) id 8865; Fri, 15 Apr 88 14:32:21 EDT Date: Fri, 15 Apr 88 08:54:55 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John C Klensin Subject: RE: Re: national versions X-To: ASCII/EBCDIC character set related issues To: Frank da Cruz > >In-Reply-To: Message of Thu, 14 Apr 88 14:55:00 MET from > > I think you are right about the two-byte international char. set. >The world isn't as big as it used to be, because of the xxxNETs. So we >need to have ONE code, having ALL chars of the world in it. Problems >will be solveable by mapping existing producer-dependent charsets into >this code when xfering files. Ensuring correct printout is depending on >printers/printer software. Sigh. Let's assume that you can make a list of "ALL chars of the world". Let's assume that you can get a list of the "important" characters in the non-alphabetic languages (Chinese and Japanese Kanji are not the only ones, just the ones you hear about most often) and get the people who use those languages to agree to never want to add another character (which would require an extensible set, which works against "ONE code"). Those assumptions are pretty unlikely to be true, but, just assume. Then your only problem is that the nature of the standardization process is that it is likely to be well into the next century before that character set can be agreed upon. There are a number of character sets for which, as far as I know, there aren't even coding proposals in the international arena (Sanskrit and Thai come to mind -- if those proposals exist, they haven't crossed my desk when I was looking). And, if you want "ALL characters of the world", you need to worry about some languages that are no longer in common conversational use, since scholars in the relevant fields want to communicate with each other - anyone for an ISO-standard Phoenician character set? Etruscan, perhaps? I think one's choice is to learn to deal with an extensible system, and hence multiple characters sets, today (or soon), or to theorize and harmonize for a *very* long time. I think I prefer the former. Also note that CCITT IA5, otherwise known as ISO646 Basic Version, was an attempt at a "universal" character set, and works fairly well in restricted applications, as does the even-more-restricted Telex character set. But it does not do very well for non-Latin alphabets or lots of characters in highly-populated Latin-derived ones, which is what this discussion is all about. 15-Apr-88 14:43:52-EST,3433;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 15 Apr 88 14:43:41-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 15 Apr 88 14:41:31 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0776; Fri, 15 Apr 88 14:41:29 EDT Received: by BITNIC (Mailer X1.25) id 8991; Fri, 15 Apr 88 14:40:27 EDT Date: Fri, 15 Apr 88 10:25:00 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Rick Troth Subject: Throw out EBCDIC? To: Frank da Cruz Thank you Jon Klensen for "right on the money" comments. Indeed IBM is obliged to keep a large customer base happy. Banks (for example) don't really care about what's new or what's happening in China, they just want what works. "What we have now is fine ... why change?" IBM will always be slow to change because most of their customers are slow to change. We should make the change as smooth as possible. I am becoming an IBM biggot myself, but there was a time when I hated EBCDIC for incompatibility. Then came 7171's, VAXen on BITNET, and ... whoooa ... we're actually making progress here. Amdahl seems to recognize the value of ASCII (ISO) and the whole idea of a more general I/O scheme. Their version of UNIX, UTS, is quite an excellent implementation. I was quite astonished to discover that it is an ASCII system. But behold: one need not give up 3270's, RSCS (even NETDATA), or VM. JNET (for the VAX) and UTS are both making inroads for ASCII (and then ISO) into the IBM mainframe world. (Actually JNET is making an inroad for EBCDIC into the VAX world :-) Since I brought up DEC at this time, I will post my "report card" on the VT220. Having gone over the white paper from Ed Hart, I compared the listing of ISO8859/1 to the "DEC Multinational" character set. Multinational diverged from 8859/1 in 15 places, five of those were collisions where DEC had defined something different from ISO and ten were left blank in the DEC definition. Jon mentioned switching character sets mid-stream. The VT200's can do that. They can also handle SI/SO if you modify APL support on your 7171. There are almost a dozen different "NRC sets" in the box. I am not as enamored of DEC as I was once, but we have a lot of VAXen on campus and thus have a lot of VT200's. I'd like one in my office. Since I mentioned UNIX, (somebody on IBM-MAIN just royally flamed UNIX) the ideas that "everything is a file" and "all I/O is performed via device drivers" are good. MVS does this (to an extent), CMS does not but it could. I don't care for the idea of a two-byte character set, but I do like the concept of mid-stream (transparent to the user) set switching. Device driven I/O can handle that quite well and makes the transition (to ISO or whatever) much smoother. This is precisely how UTS as an ASCII system can work just fine with EBCDIC 3270 tubes. I am really quite impressed; you really should all see it. (mercy! I don't mean to advertise for Amdahl) - Rick 15-Apr-88 16:13:01-EST,1116;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 15 Apr 88 16:12:53-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 15 Apr 88 16:10:47 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0974; Fri, 15 Apr 88 16:10:46 EDT Received: by BITNIC (Mailer X1.25) id 0919; Fri, 15 Apr 88 16:09:56 EDT Date: Fri, 15 Apr 88 15:53:34 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John C Klensin Subject: RE: Throw out EBCDIC? X-To: ASCII/EBCDIC character set related issues To: Frank da Cruz Small addendum to Rick's note: The DEC VT300 has support for Latin Alphabet 1. The 200 predates it, and "DEC Multinational" was, approximately, a best guess. 15-Apr-88 19:20:02-EST,17835;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 15 Apr 88 19:19:57-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 15 Apr 88 19:18:24 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 1239; Fri, 15 Apr 88 19:18:22 EDT Received: by BITNIC (Mailer X1.25) id 3169; Fri, 15 Apr 88 19:11:48 EDT Date: Fri, 15 Apr 88 18:55:26 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues Comments: Code: CECP 500 From: Otto Stolz +49 7531 88 2645 Subject: Some clarifications To: Frank da Cruz Dear list subscribers, first, let me tie together a few loose ends of the discussion that commenced on 15 Mar 88 by Some Important Comments from Howard Gilbert. I hasten to disclaim: I'm not the Network Expert of our site; rather, my duties relate to the end-user interface (software & advice). Hence, the opinions and proposals presented below, are my private contribution and bear no official character. IBM's CECPs ----------- Further to the ISO 8859-1 standard of 1987, IBM changed their Graphic Character Set 00697 to conform with the character set of the ISO standard. To do so, they had to replace these 4 (four) characters: SC07 florin (guilder) sign SM10 double underline LI61 small letter dotless i SP31 numeric space with those four characters: SM52 copyright sign SA07 multiplication sign ND011 one superscript SA06 division sign Please, make sure that the tables you use are dated 1987 or later, and contain the new character set. Moreover, IBM has defined 9 (nine) Country Extented Code Pages (serving 17 languages), which contain the characters of GCS 00697 in various per- mutations. Again, you should make sure that you use the new CECPs. Clearly, IBM had two aims in mind: 1. provide an unambiguous mapping between any pair of CECPs, and possibly between any CECP and ISO 8859-1; 2. avoid data conversion at their customer's sites when they switch over to the new CECPs. This latter aim prevented the introduction of a single CECP: we all now have to live with the consequences of the mistake IBM (and ISO, by the way) have made dekades ago, when they started that "National Characters" rubbish in their code pages. We call this a "Treppenwitz der Welt- geschichte", in German. Characters convey a meaning --------------------------- ISO 8859-1 and the CECPs deal with coding of characters into bytes, hence the only sensible mapping between them is via matching characters (i.e. grafics or character descriptions, eg. "small letter a with grave accent" of ISO 8859-1 is mapped on "LA14 a Grave Small" of the CECPs). It's a pitty, that IBM and ISO differ even in the wording of their respective descriptions, but I guess, everybody can live with that. This mapping would allow the transfer of notes, scripts and the like between user's in all countries speaking one of these 17 languages. As an aside, letters with diacritical marks, and German and Icelandic National Letters, are vital for these respective languages. They are not just "fancy characters", but rather letters in their own right. Recently, I've seen in a Swiss newspaper an amusing example of the habit of using "ss" instead of the Sharp-s "": > Brigitte Bardot mit ihren beachtlichen K>rpermassen (BB, and the considerable masses of her body) whilst the writer probably intended to say Brigitte Bardot mit ihren beachtlichen K>rpermaen (BB, and her remarkable anatomical measurements). O yes, Michael Sperberg-McQueen, most Germans and Austrians cannot imagine how the Swiss can do without Sharp-s. With program sources, things are similar, but a bit more complicated. Program sources are written by human beings, and they are on this world to be read by human beings| (As a pleasant accompanying phenomenon, they can also be obbeyed by computers.) If it were the other way round, we all would enter our programs bitwise, in machine-language. Hence, program sources must look alike in books, on screens, in listings, and on the keyboard. That's the reason, programming languages' standards do specify characters, and do not (and should not) specify code points (cf. Howard Gilbert's remark). On the other hand, they should (and normally do) specify alternative representations to take account of limited character sets (not Code pages|), eg. "(*" for "{" in Pascal. And after introducing any ISO 8859 character set, compilers should cease using characters for wrong meanings, e.g. tilde, or circumflex accent for not-sign. IBM falls far short of the goal of using characters sensibly: without being ashamed the least, they sell you equipment for a couple of Mega- DMarks (a terminal, a control unit, a computer, an operating system and a compiler) which is not capable of translating a Pascal program, even as simple as PROGRAM ebcdic (output) (* example of a little Pascal program *) ; BEGIN writeln ( 'hello|' ) END . just because you happen to live in Germany, where the word for "of" contains a letter(|) that is interpreted by the compiler as an end-of- comment-symbol| :-( Clearly, the next step to be required from IBM must be adapting their language processors to the CECPs. Recognizing dual EBCDIC codes for some characters, is not enough for the compilers and other applications: as long as there are various EBCDICs (call them CECPs or what you want), you must be able to customize them for the variant to be used| Folks, please help convincing IBM by sending in as many APARs as you have pro- ducts. The same holds for other software suppliers. But now, for the difference between plain text and programs. In addition to using characters, you may also refer to them. This is no problem in plain text (cf. the sharp-s example, above) -- but in programs you normally use its code point to refer to a character| And here the Code Page crept in, again: if you are going to convert a source program from one code to another, you are doomed to understand its ends and means to a T. Every number (be it hexa-dekadic or decimal) might well be a character code, or a character code offset, or whatever you can imagine. Hence, automatic (and reliable) code conversion of program sources is virtually impossible. Example: you read in a Pascal program  'a'-'z', '', '>', 'u', '' ! From your knowledge, that this program comes from a ISO computer in Germany, you have to infer the meaning "some small letter", and you have to translate it into something like  'a'-'z', ''-'' ! (* for ISO 8859-1*) or  'a'-' ', ')'-'', 'W', 'a'-'i' , 's'-'.', 'j'-'r', 'N' , 's'-'z', 'J'-'m', '-'-'' ! (* for CECP 500 *) and similar (but different) for other CECPs. Now, you probably understand IBM's reluctance to a single universal CECP. FORMER CODES ------------ The CECPs meet a lore (80 to 300, depending on whom you ask) of more or less established codes and practices. 1. There is such a thing as The Factual Software Code: though no stan- dard (neither ISO, nor national, nor internal) covers this practice, software designers seem to unanimously take English(U.S.) EBCIDC plus TN-style brackets minus OCR characters for "the" EBCDIC. A couple of months ago, I met an IBM employee who is substantially involved in Codes and Keyboards design. When I told him that the brackets are normally assumed on code points AD & BD, he exclaimed: "But they never belonged there|" Then I told him, that even IBM's Pascal/VS compiler accepts only(|) AD and BD for the brackets. He had never heared of such a thing| (Boys, I'm not kidding; that really has happened here.) As I guess from the recent discussion in this list, the BITNET-Code (if there is such a thing) probably looks very similar. The character set of this Code is too limited to support any other language than English. So, any CECP would be an improvement. I do not believe that IBM might be willing to define a tenth CECP, based on this code (for which not even a standard exists). 2. IBM's I/O Interface Codes are selected during control unit customi- zation. The trouble: chosing some keyboard layout, you implicitly chose an I/O Interface Code. Example: if you chose German Keyboard, you get the "p" (capital U with diaresis) on the very codepoint, US EBCDIC uses for the exclamation point. Hence, important sentences in notes from abroad, and in every IBM-supplied help text, are marked for us, inevitably, with U-Umlaut. Note, that every IBM 327x (or similar) screen is capable of displaying all letters required for 16 languages (anything except Icelandic) and a lot of special characters. It's the control unit, that prevents you from seeing these characters -- or allows you to display them, if you have installed Configuration Support C, D or T. In fact, every Configuration Support establishes its own EBCDIC variant; hence, the nine CECPs cannot be truly upwards-compatible. Matthias Melcher's suggestion is based on these Configuration Supports. 3. The 7171 Control Unit seems to be based on a similar code as 1., above. The trouble here is, that any character which is not in the code translation table of this ingenious device, is translated into a colon ":". Hence, you can only get about 90 different characters through this bottle-neck, when you need about 190 different ones. The 7171 manual states, that this translation table can be amended. Has anybody done this, so far? If so, please drop me a note stating your experiences| 4. Kermit has it's own ideas on ASCII-EBCDIC translation. (Very similar to 1., above.) During Terminal Emulation, it's confined to 7171's limitations (at least in our case, where the PCs are connected via a 7171). For File Transfer, Kermit can be customized by a suitable take file; so at least in this area incompatibilities can be solved. 5. IBM PCs and clones have used their own 8bit character set, comprising the national letters of the same 16 languages but different special characters. All PCs, regardless of their keyboard, use identical codes for this character set (that's the purpose of that keyboard program you get loaded, when yo boot-strap your PC). Pity, not all software designers recognizing this scheme. Notably, terminal emulation programs tend to bypass the keyboard program and hence are useless outside USA. The same tends to hold for software designed for multiple computer brands. I guess, IBM has started already delivering PCs with a new character set, a superset of ISO 8859-1 (they kept 16 classical PC characters, most of them semi-graphics). The code is ISO 8859-1 (i.e every codepoint above A0 is re-assigned) plus the additional characters in codepoints 80 to 9F. WHAT CAN BITNET DO? ------------------- BITNET is primarily designed for transferring messages, i.e. plain text. Let's set a comparatively humble goal, for the moment: BITNET should transmit any plain text consisting of characters from the ISO 8859-1 character set (i.e. GCS 00697) sensibly and undisturbed. This must be our first goal, leaving out * special handling of program sources (cf. remarks above), * other latin based alphabets, * non-latin left-to-right single-byte coded languages (e.g. Greek), * right-to-left languages, and * double byte coded languages. Program sources require human intervention for a thorough, sensible translation (and they must be enabled for that purpose, cf. IBM's "National Language Information and Design Guide" series, SE09-8001, SE09-8002, ...) The other four require special equipment. Throughout BITNET, English seems to take the role of a Lingua Franca; hence even participants in non-latin-writing countries will have to use a latin-writing terminal for their BITNET correspondence. BITNET is still far away from even this moderatest goal; nor does it handle the former codes sensibly. One more example: Germany's primary BITNET node, DEARN, refused to accept UDS entries or list subscriptions containing German Sharp-s or German Umlauts in the participants proper name. The subscribers had to substitute other characters (e.g. "oe" or even "-") for such characters in their names. That happened in a country, where you are legally entitled (96 BDSG, 96 LDSG) to having your name's spelling corrected, if it's mis-spelled in any database| As stated earlier, the transgression from some 300 different EBCDIC variants to 9 EBCDIC + 1 ISO would be a major improvement. HOW COULD IT WORK? ------------------ I suppose, that every site will try to introduce one single CECP (or ISO 8859), and do away with the old Codetable mismash. This will take time, as there is new equipment involved (new terminals, 3174 instead of 3274, updatet compilers, &c.) Also, the old data and programs will have to be transformed, suitably. Note, that BITNET is only a small part of the whole EDP business| During the transition phase, MM's proposal could help smoothing things. But behold, SET INPUT and SET OUTPUT cannot be the last word. These commands are only available in CMS; and they take effect only in CMS line-mode and in XEDIT. CP commands, and CP messages are not translated, and most full-screen mode programs do not honour the SET INPUT and SET OUTPUT commands. What a pleasure, when you enter TELL Kurt Gru Gott| and CP displays to Kurt the Message Gr,( Gott| After having chosen a CECP (or ISO8859-1), the site could send out its texts in this local code. They will have to be code marked: in the tag, and preferably also inside the text. For NOTES, RFC822 could be enhanced with a "Code:" field, such as I have used in the header of this very note. I think, there are enough Network Experts listening to this list: they should be able to design a suitable amendmend to the network's standards. The price for sending out the notes in a local code variant (well, that's the very procedure, most sites are following right now) will be the obligation of translating incoming messages. So, every site will use at most 9 (of the possible 90) translation tables. Again, this could be done via SET INPUT and SET OUTPUT, as MM suggested (that's exacly the way, I read notes from USA and elswhere). Later, the mailer, or RSCS, or some similar software piece, would do the translating for all incoming files, and the end-user will cease bothering with the details. There will be need for a special marking, say "Code: Binary" preventing the file from being translated, at all. (Andr) PIRARD should be able to continue sending his files through the net.) HOW CAN WE SIMPLIFY CODE TABLE HANDLING? ---------------------------------------- Divide & impera| Instead of devicing 90 related Code Tables (and pains- takingly checking them for consistency), we could write down 10 "Half- Tables". These latter would relate one code page, respectively, to a common description of the characters. From these Half-Tables, a simple program could build the translation tables for every desired code-pair. The ISO 8859 descriptions of the characters are a bit too long to make for a feasable common base of our half-tables. But, what about IBM's character identifiers, accompanying the GCS and CECP tables? Instead of "small letter a", we could use "LA01"; instead of "small letter a with grave accent" we say "LA13", and instead of "small diphthong a with e", we have "LA51". Thus, the upper part of the CECP 500 half-table would be: Y 4 5 6 7 8 9 A B C D E F -----+------------------------------------------------------------- 0 Y SP01 SM03 SP10 LO61 LO62 SM19 SM17 SC04 SM11 SM14 SM07 ND10 Y 1 Y SA06 LE11 SP12 LE12 LA01 LJ01 SD19 SC02 LA02 LJ02 SP31 ND01 Note the new CECP 500, having SA06 (Division Sign) instead of the older SP30 (Numeric Space). Aside: these identfiers start with the following letter(s): L for Letter, ND for Numerical Digit, NF for Numerical Fraction, SA for Arithmetical sign, SC for Currency sign, SD for Diacritical mark, SP for Punctuation marks, SM for Miscellaneous special characters. I can also set up a Half-Table for my controll-unit's I/O interface code, hence a simple program could generate from these two half-tables an EXEC with suitable SET INPUT and SET OUTPUT commands to display CECP 500 on my terminal -- simply by matching the SM11 entry in code- point C0 (CECP 500) with the SM11 entry in code-point 75 (Austrian/ German I/O Interface Code) and generating the REXX-line "SET OUTPUT C0 2"; "SET INPUT 2 C0" /* SM11 */ from this match. (The "2" byte comes out as opening brace, on my terminal.) Thus, the generating of MM's procedures can be mechanized in the same way as the generating of the BITNET translation tables. The same holds for Kermit's Take-Files. I would appreciate any comments on this proposition. Regards Otto. 17-Apr-88 12:39:01-EST,4010;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Sun 17 Apr 88 12:38:59-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Sun, 17 Apr 88 12:37:21 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 2123; Sun, 17 Apr 88 12:37:18 EDT Received: by BITNIC (Mailer X1.25) id 0264; Sun, 17 Apr 88 12:36:31 EDT Date: Sun, 17 Apr 88 02:33:42 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John C Klensin Subject: RE: Some clarifications X-To: ASCII/EBCDIC character set related issues To: Frank da Cruz Otto, Well-presented, clear, and focused on what I think are the right set of problems. One comment/plea: While the local system's controller may be the "right" place to establish the default code table binding for various supported devices, keep in mind that transformations are not guaranteed to be reversible at least in going to the large [historical] collection of existing devices. It is possible that I will have a German-capable (i.e., supporting German "extended" characters -- those that don't appear in ASCII / English by code table switching) or a German-national (i.e., supporting those characters *instead* of the ASCII special characters) -- available at a site that mostly has only US-national (i.e., ASCII-only) devices. If I do, I want the ability to see exactly what you write if you write to me in German, not the local interpretation of what German ought to be spelled like in IA5/ISO646 Basic version. I might even want to invent, for my own use, a set of two-or-three-character conventions. For, if you send me that word which we translate to English as "or", and I don't have lowercase-u-umlaut available, I might prefer that my smart terminal show me one of those "programming language" convolutions, such as fu:r or fu..r, rather than trying the either fuer or fur, or showing me f|r (that is 'f', broken-vertical-bar, 'r' on my device at the moment). I would not expect to transmit this sort of notation convention, or expect anyone else to read it, but it is important that the exact text of what was sent, and the information about how it was encoded, be available to the end user's mail-reading program or agent. While you excluded the cases, the underlying problem becomes much more important for messages that might involve non-Latin alphabets: A reasonable site default might be to have them rendered into Latinized transliteration (there are even ISO standards for the Latin alphabet representation of several non-Latin alphabets). But a local user with the right equipment would, presumably want to see whatever the message looked like to the person who typed it. And don't hope for changes in RFC822 for several reasons. The most important is that local modifications and extensions made by various people that treat the header fields just as slightly-structured free text comments already have made it very difficult to build an adequate processing agent. The introduction of "Code:" is useful, beyond a warning that I'm not going to be able to read what follows, only if the predicate comes from a very restricted vocabulary and is arranged so that an agent can process the text as it specifies (as your discussion implies). X.400, by contrast, has provision for such a field. But the "red book" version doesn't have eight-bit character sets: unless you want to specify, e.g., Teletext encoding, you will find yourself limited to IA5-text. And CCITT IA5 = ISO646 = the restricted character sets from which our current problems originated. john 18-Apr-88 08:53:24-EST,1241;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Mon 18 Apr 88 08:53:20-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Mon, 18 Apr 88 08:50:49 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 2675; Mon, 18 Apr 88 08:50:40 EDT Received: by BITNIC (Mailer X1.25) id 4044; Mon, 18 Apr 88 08:47:28 EDT Date: Mon, 18 Apr 88 08:41:57 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: "Thomas D. Denier" Subject: IBM Graphic character identifiers To: Frank da Cruz Otto Stolz states that an initial letter 'L' in an IBM character identifier stands for 'letter'. It actually stands for 'Latin alphabetic'. IBM has assigned other initial letters to other alphabets, as follows: A Arabic G Greek H Hebrew J Katakana K Cyrillic Thus, for example, Latin lower-case 'a' is LA01, and Greek lower-case alpha is GA01. 18-Apr-88 12:54:50-EST,1720;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Mon 18 Apr 88 12:54:47-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Mon, 18 Apr 88 12:53:01 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 3050; Mon, 18 Apr 88 12:52:59 EDT Received: by BITNIC (Mailer X1.25) id 6469; Mon, 18 Apr 88 12:52:13 EDT Date: Mon, 18 Apr 88 12:05:56 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John Kesich Subject: Re: IBM Graphic character identifiers To: Frank da Cruz In-Reply-To: Message of Mon, 18 Apr 88 08:41:57 EDT from How does IBM's designation differ from ISO 6937/2 (Coded character sets for text communication - Part 2: Latin alphabetic and non-alphabetic graphic characters)? I know this is wishful thinking but could they actually be one and the same (or at least 1 a subset of the other)? As for the notion of adding a "code" field to mail headers, I don't think it would buy you very much even if it were implemented. Two problems suggest themselves: 1) what about non-mail transmissions 2) what about shifting to other codes within the text (How does IBM provide for shifting from 1 code page to another?) There are already ISO standards which allow you to shift from code set to code set to your heart's delight - why reinvent the wheel? (646, 2022, 4873, 8859) 18-Apr-88 15:07:01-EST,2477;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Mon 18 Apr 88 15:06:56-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Mon, 18 Apr 88 15:04:31 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 3273; Mon, 18 Apr 88 15:04:24 EDT Received: by BITNIC (Mailer X1.25) id 7894; Mon, 18 Apr 88 15:03:47 EDT Date: Mon, 18 Apr 88 13:11:38 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Michael Sperberg-McQueen Subject: Header field for CODE and reinvention of wheel To: Frank da Cruz John Kesich asks why a Code: field is necessary, since ISO has already standardized methods for shifting between character sets. Perhaps a preliminary report on answers to my earlier query about ISO 2022 implementations is in order. The reason "Code:" would be useful, and might be necessary, is that ISO 2022 code-page switching via SI/SO is possible only when the G0 and G1 (and C0 and C1, for that matter) sets are known in advance to all parties. ISO standards for identifying coded character sets by means of registered escape sequences have no known implementation in any automatic device. (Possible exception: some printers may accept the registered escape sequences to specify G1 before SO is used. Certainly some use escape sequences -- whether they are the registered sequences or not is another matter.) There are a (small) number of devices which accept SI/SO (terminals and printers only, so far -- no one has reported on successful or regular use of SI/SO in data transmission to distant sites). But so far only three or four people have reported anything at all. If we assume that silence implies that one has not heard of any notable use of ISO 2022, then it appears that the vast majority of sites and devices do not use it. Perhaps someone better informed about Bitnet can say whether the Bitnet header can or should or cannot or should not handle a CODEPAGE field. I was always told "Bitnet is EBCDIC" -- maybe we should at least be able to specify what flavor of EBCDIC? Michael Sperberg-McQueen, University of Illinois at Chicago 18-Apr-88 19:51:24-EST,3256;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Mon 18 Apr 88 19:51:19-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Mon, 18 Apr 88 19:49:16 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 3601; Mon, 18 Apr 88 19:49:14 EDT Received: by BITNIC (Mailer X1.25) id 0347; Mon, 18 Apr 88 19:48:19 EDT Date: Mon, 18 Apr 88 19:43:17 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John Kesich Subject: punched cards, anyone? X-To: iso8859%jhuvm.BITNET@cimsa.nyu.edu To: Frank da Cruz The following list may cause one to consider the possibility that BITNET should convert over to ISO8859: 857 JNET 7 RNET 2 POWER 1 NOS 729 RSCS 6 HUJI- 2 NONE 1 NJI 133 UREP 6 ANL N 2 MUSIC 1 NJE 4 124 ALIAS 5 PMDF 2 MTF 1 NJE 1 117 JES2 5 OASYS 2 MAILE 1 NAM 82 TCP/I 5 JA JN 2 INTER 1 MRJE/ 52 NJEF 4 IBM R 2 CARLE 1 MEMO 42 NJE 3 TIELI 2 BERKH 1 MACH2 42 HOMEB 3 RM 1 UNIX 1 JES2/ 26 ? 3 HUJI 1 TRANS 1 IX/37 20 ANJE 3 GATE 1 SNA/N 1 HUMAI 19 BERK 3 ANL/N 1 RTP/1 1 HASPM 18 BITE 2 TELCO 1 RJEF 1 ETHER 13 JES3 2 RSCSV 1 RES 1 ECF 12 HASP 2 RJE S 1 RCOM 1 CDC 9 MULTI 2 RHF 1 PMDF- 1 ANY 8 DECNE 2 PRIME 1 NRV 1 AMF This list is just a count of the different entries in the SYSTEM TYPE field of BITNET LINKS804. (What happens if we add in NETNORTH & EARN?) Just counting JNET & UREP, pretty close to half the nodes are ASCII machines. (a precise definition of my misuse of terms for those who may otherwise become confused: EBCDIC - the stuff they use on IBM's ASCII - the stuff they use on most everything else ISO8859 - the family of codes which will make ISO2022 practical ISO8859/1 - ISO8859/1 admittedly not 100% accurate, but, hey, it works for me.) Finally, let's not forget all the networked hosts, workstations and pc's (IBM included) which hang off BITNET gateways and send mail through them, how many of those do you suppose are EBCDIC? Perhaps a survey of BITNET hosts should be made. The 2 questions I'd like answers to are: 1) how would you feel about converting all BITNET links to ISO8859? and for IBM nodes: 2) if IBM were to announce a new code page which was code-point-to- code-point and graphic-to-graphic identical with ISO8859/1 and pledged to keep it that way, would you migrate to it? 20-Apr-88 13:37:24-EST,2093;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 20 Apr 88 13:37:22-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 20 Apr 88 13:37:28 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6126; Wed, 20 Apr 88 13:37:27 EDT Received: by BITNIC (Mailer X1.25) id 5947; Wed, 20 Apr 88 13:36:50 EDT Date: Wed, 20 Apr 88 11:54:00 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Rick Troth Subject: Re: punched cards, anyone? To: Frank da Cruz In-Reply-To: Your message of Mon 18 Apr 88 19:43:17 EST I think the real objective here is to find a (nearly) one-to-one mapping between EBCDIC (North American) and ISO8859/1, with similar 1-1 mappings between other national EBCDIC's and ISO8859/whatever. All those ASCII machines listed in BITNET NAMES are already performaing their own translation between ASCII and EBCDIC. To switch the whole network at once would cause many people much grief in both the short- run and the intermediate-run. In the long-run, we would hope that IBM supported products (like RSCS or whatever will someday replace it) will be able to speak ISOxxxx, but remember that that is most likely 21st-century- long-run. JNET (and I suppose others) have their translate tables in place. What we are striving for is a "correct" translate table where I could do a TELL AT Talk  to me and he would see 3 hex A2's on his VT330 as per DEC Multinational. "Talk cents to me" Kermit transfers would work correctly in both directions (if we achieve 1-1). Translate tables are really not so bad if we can just agree on the translation. Or have I completely missed something? - Rick 20-Apr-88 15:18:47-EST,5977;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 20 Apr 88 15:18:44-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 20 Apr 88 15:18:49 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6234; Wed, 20 Apr 88 15:18:47 EDT Received: by BITNIC (Mailer X1.25) id 7154; Wed, 20 Apr 88 15:16:49 EDT Date: Wed, 20 Apr 88 21:05:04 GMT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Matthias Melcher <$28%DHDURZ1.BITNET@CUVMA.COLUMBIA.EDU> Subject: PC ASCII To: Frank da Cruz Dear list subscribers, the following is a contribution of an IBM code specialist to ECMA about the history and backgrounds of Code Page 850 (PC Latin 1). The attachments mentioned are not included, since I don't have them in machine readable form. Matthias Melcher. A new character set for the IBM PC, by W.F.Bohn When I was asked recently to make available to TC1 a copy of the new IBM PC code page I felt that I could not honour that request without a few words of explanation. The IBM PC was developed to be a very versatile computing device capable also running video games, teaching programs, etc. That is why the original code table (identified as 437 and Attachment 1 to this contribution) has some unorthodox features: - the code is based on ASCII - not on EBCDIC, - the code table positions in column 00 and 01 have graphic characters assigned to them in addition to the normal 7-bit control characters, - a graphic character was allocated to table position 07/15 in addition to, or as a graphical representation of, the control character DELETE, - as controls beyond the normal 32 were not envisaged (actually all 256 code table positions could be used for controls or for graphic characters) the right hand half of the code table was divided into . three columns with graphic characters believed to satisfy the requirements of the major West European languages, . three columns with line and box drawing characters plus other characters believed useful for creating diagrams, company logo's, etc. . two columns with mathematical and technical symbols. Later, when the PC was connected to other computing equipment it turned out that its graphic character set did not match any other existing one and that interchange of data between two different IBM machines would have to be limited to the small number of characters common to both installations. The advent of the 8-bit single-byte coded character set of ECMA-94 made a solution of that dilemma possible. By changing the IBM EBCDIC as well as the PC code to the character set of ECMA-94/1 interchange of all characters without loss of information could be achieved. An important decision had to be made, however. Which of the characters of table 437 should be sacrificed and which should be taken over into the new code page (identified as 850 and attachment 2 to this contribution)? Furthermore, should the structure of the code table be changed to that of ECMA-94 or should the existing structure be kept? In the interest of compatibility with existing equipment and existing implementations it was decided: - to include all the graphic characters of ECMA-94/1, - to keep the original structure of the code table - to leave those graphic characters in table 437 and now also in table 850 in their original code table positions (there are two exceptions which need not be explained here), - to select for the 32 positions in the right hand half of the code table (not needed for characters from ECMA-94/1) a useful set of the line drawing and other characters of table 437 and keep those characters also in their original positions. These characters selected are . the 11 basic line drawing characters in thin (or single line) rendition, . the same 11 characters in bold (or double line) rendition, . three shading characters (light, medium, heavy), . three block characters (full box, upper half, lower half), . a small solid square for different uses. To the remaining three positions were assigned graphic characters formerly in use in IBM but removed when IBM equipment changed their graphic character sets to that of ECMA-94/1. Backward compatibility with the existing equipment was the reason for this decision. For interchange of the common character set between EBCDIC and PC oriented equipment new translation correspondences were determined which, when the traditional translation correspondence between EBCDIC and ISO (or ASCII) code equipment is used, would lead to an orderly arrangement of the additional characters in columns 08 and 09 of ECMA-94. The arrangement is as similar as possible to the one proposed for ISO 6937-6. This may be of some importance when one day the code extension procedures of ECMA-35 will allow additional G sets of 128 characters instead of only 94 or 96. Information on the translation correspondence between EBCDIC, PC-ASCII, and ECMA (or ISO) oriented coding is appended to this contribution as Attachment 3. Also attached is a copy of the international version of EBCDIC (identified as 500 and Attachment 4 to this contribution). The graphic character set common to ECMA 94/1, the IBM PC and IBM EBCDIC (identified as 697-1) is Attachment 5. Conclusion: The new IBM PC code table (850) is the best possible compromise between the desire to implement the graphic character set of ECMA-94/1 and the need to create a minumum impact on the existing implementations. Of course, the IBM PC implementation of ECMA-94/2 followes the principles outlined above. 20-Apr-88 19:55:59-EST,2743;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 20 Apr 88 19:55:55-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 20 Apr 88 19:55:40 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6598; Wed, 20 Apr 88 19:55:38 EDT Received: by BITNIC (Mailer X1.25) id 0871; Wed, 20 Apr 88 19:54:35 EDT Date: Wed, 20 Apr 88 16:49:09 PDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: June Genis Subject: BITNET's current mapping To: Frank da Cruz REPLY TO 04/20/88 16:33 FROM ISO8859@JHUVM.BITNET "ASCII/EBCDIC character set related iss: BITNET's current mapping >What is BITNET's current ASCII-EBCDIC standard? >and where may one obtain a copy? >Thanks in advance. Sorry, John. No such thing currently exists which is one reason why data sets are trashed along the way. Since BITNET is defined to be an EBCDIC system, in theory any ASCII node should be translating to/from EBCDIC for anything originating/terminating at that node. No standard has ever been defined as far as I know for what translation should be used. An even more ambiguous situation potentially exists when an ASCII node is an intermediary node. Can an ASCII node be anything other than an end node? If so, are files passing thru it translated twice or not at all? In the latter case, might there be a chance that the translations are not reversible such that the EBCDIC file emerging on the other side is different than that which entered? While it's clear to me that the absence of a standardized translate table could result in things being messed up when the communication is between an ASCII and an EBCDIC node, it is not clear to me if the intervening nodes which just happen to be along the path can have an impact as well. This strikes me as the worst problem since finding out which node trashed the file could be a real bear. It's not even clear to me if the possibility of implementing a standardized translate table even exists as many node have local variations in translation that they are committed to for one reason or another. Can we assume that all systems have the ability to apply one translation to their network mail and another in other situations (say an ascii terminal attached to the host which is used to general mail shipped both locally and to the net)? /June To: ISO8859@JHUVM.BITNET 21-Apr-88 09:35:00-EST,1818;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Thu 21 Apr 88 09:34:39-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Thu, 21 Apr 88 09:34:13 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 7058; Thu, 21 Apr 88 09:34:12 EDT Received: by BITNIC (Mailer X1.25) id 4432; Thu, 21 Apr 88 09:33:36 EDT Date: Thu, 21 Apr 88 02:03:36 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edward_Vielmetti@um.cc.umich.edu Subject: ISO Latin-1 terminals X-To: ISO8859%JHUVM.BITNET@CUNYVM.CUNY.EDU To: Frank da Cruz There's an ISO Latin-1 font available for the Apollo workstations, which Jim Rees (umix!apollo!rees) pointed out in a recent usenet posting. Conceptually, it's real easy for any bitmapped terminal with a replacable character set to make up a font like that; the difficulties arise when the data transport paths are not 8-bit transparent (in the all-ASCII world) or when goofy EBCDIC machines get in the way. A Latin-1 font for the Apple Macintosh would be easy to construct, but there's the underlying problem that the typical Mac font has its own Apple arrangement for the upper set of characters. I think you could still get everything to print out OK with a suitable manipulation of Postscript. Edward Vielmetti, U of Michigan Computing Center USERW02S@UMICHUM emv@umix.cc.umich.edu (If you can think of any reason to send Bitnet mail to the UMICHUM address, please do so - it's a new service which might fail.) 21-Apr-88 22:14:47-EST,4697;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Thu 21 Apr 88 22:14:40-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Thu, 21 Apr 88 22:14:19 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 8217; Thu, 21 Apr 88 22:14:18 EDT Received: by BITNIC (Mailer X1.25) id 5752; Thu, 21 Apr 88 22:12:17 EDT Date: Mon, 18 Apr 88 17:57:00 MET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: IBM4250 etc. To: Frank da Cruz Dear list subscribers I think I was too optimistic about EBCDIC in my tutorial. Since then I have been collecting code pages, and it seems now that there is not only a separate code page for every language, but for every piece of printing (or imaging) hardware as well. This results in some kind of Cartesian product, at least two parameters are required for describing an item. Some of these code pages are in ISO/ANSI style, some mirrored. My catalogue is not yet complete, please provide me with the missing data (IBM numbers in particular). "Madamina, il catalogo i questo": IBM Corporate System Standard, CSS 3-3220-2 (Latest version, please) IBM Technical Reference for Digitized Type G544-3516 (not avlbl. here) IBM VS FORTRAN Language and Library Reference SC26-4119-1 IBM Displaywriter Host Attach Programming Guide IBM DCA RFT Reference IBM 3270 Information Display System, Character set Reference GA-27-2837-9 IBM GDDM (which one, the set here is not complete) IBM 3800 Printing Subsystem Model 3 Font Catalog SH35-0053 (id.) IBM 3800 (another programming guide) IBM 4250 Printer The code pages of the 4250 are particularly nice. I selected the codes for six important characters: left square bracket, backslash, right square bracket, left brace, vertical bar, right brace (in that order): (the square bracket are here AD and BD, I can type them at my 3278 only in hexadecimal.) IBM 4250 Code Pages AFTC [ \ ] { | } German 0382 63 EC FC 43 CC DC Belgium 0383 4A 48 5A 51 DD 54 Brazil 0384 71 E0 68 CF 48 51 Canada F 0385 44 5A 79 51 DD 54 Denm./Norway 0386 9E E0 5A 9C 70 47 Sweden/Finl. 0387 B5 71 5A 43 CC 47 France 0388 90 48 B5 51 DD 54 Italy 0389 90 48 51 44 CD 54 Japan 0390 B1 B2 6A C0 4F D0 Latin Am.(Sp) 0393 4A E0 5A C0 4F D0 Portugal 0391 4A 68 5A 46 CF D0 Spain 0392 4A E0 5A C0 4F D0 UK, Aus. NZ. 0394 B1 E0 6A C0 4F D0 US, Canada E 0395 B0 E0 6A C0 4F D0 International 0361 4A E0 5A C0 6A D0 APL 0293 AD B7 BD BF "Ma in Espagna che glie mille e tre." I suppose that the ordinary national code pages are only a little bit less confusing. Mr. Stolz's letter arrived here in good order, only his German jokes missed their point, because our STC/Siemens laser printer closely follows the GT12 convention from the IBM 3800 software, and prints mostly spaces for accented letters, and a vertical bar for the exclamation sign (the same for Mr. Klensin's broken bar). I have not seen IBM's latest invention according to Mr. Stolz as yet, but I remain sceptical. I think 9 tables are still too much. One single universal code page is what is wanted. Before it is shown that really not all characters can be accomodated I remain unconvinced. This code page can be used for BITNET interchange of text. Locally it can be converted to the historical version. There are more things in Mr. Stolz's letter that deserve comment, but that will come later. Just now I propose to you a little experiment. Suppose someone in Norway is typing a text at a PC. Of course it contains many AE, ae, A-ring and a-ring, barred O and o, and so on. Now he transfers the text by Kermit to an IBM system, by EARN to a VAX system, and by Kermit to the PC again. This involves conversion from ASCII to EBCDIC and back. Now he does the same, but first to a VAX, then again to an IBM, and again to the PC. The difference between the two he cannot explain, but we can. It is sufficient to take the sequence: AE O/ A0 ae o/ a0 (I suppose it is clear what I mean. German users may take A O U a o u with umlauts.) FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 21-Apr-88 22:47:24-EST,6910;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Thu 21 Apr 88 22:47:17-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Thu, 21 Apr 88 22:47:08 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 8245; Thu, 21 Apr 88 22:47:06 EDT Received: by BITNIC (Mailer X1.25) id 6021; Thu, 21 Apr 88 22:45:05 EDT Date: Tue, 19 Apr 88 14:43:00 MET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: more comments on O. Stolz To: Frank da Cruz Dear list subscribers One should never quote texts from memory. The correct version of Leporello's aria from Don Giovanni is: Madamina! Il catalogo [ questo delle belle ch'amh il padron mio; un catalogo egli [, ch'ho fatto io; osservate, leggete con me! In Italia sei cento e quaranta; in Alemagna cento e trent'una; cento in Francia, in Turchia novant'una; ma in Ispagna son gik mille e tre! This text in Italian contains some accented letters. The question is how to transmit this by BITNET correctly. I put carefully (with ISPF/PDF CHANGE) the corresponding CP500 codes at the right places, hoping that you could at least print it. But that may not work everywhere. I also could put in the ISO6937-2 designations (also used by IBM), preceded by a & for identification. But this is hard to read. The other way out is to devise a notation, that only uses the 94 character set, and is suitable for conversion by a little program to the extended local printer set. This is what we use here. As long there is universal code we should agree on temporary solutions. What is your proposal for transmitting texts? Madamina! Il catalogo &LE13 questo delle belle ch'am&LO13 il padron mio; un catalogo egli &LE13, ch'ho fatto io; osservate, leggete con me! In Italia sei cento e quaranta; in Alemagna cento e trent'una; cento in Francia, in Turchia novant'una; ma in Ispagna son gi&LA13 mille e tre! Madamina! Il catalogo \e questo delle belle ch'am\o il padron mio; un catalogo egli \e, ch'ho fatto io; osservate, leggete con me! In Italia sei cento e quaranta; in Alemagna cento e trent'una; cento in Francia, in Turchia novant'una; ma in Ispagna son gi\a mille e tre! The following comments pertain to Mr. Otto Stolz's letter: >Moreover, IBM has defined 9 (nine) Country Extented Code Pages (serving >17 languages), which contain the characters of GCS 00697 in various per- >mutations. Again, you should make sure that you use the new CECPs. What is the relation between these and CP500, and what is the source document? >Clearly, the next step to be required from IBM must be adapting their >language processors to the CECPs. Recognizing dual EBCDIC codes for >some characters, is not enough for the compilers and other applications: >as long as there are various EBCDICs (call them CECPs or what you want), >you must be able to customize them for the variant to be used| Folks, >please help convincing IBM by sending in as many APARs as you have pro- >ducts. The same holds for other software suppliers. This a most unfortunate idea. If people want to adapt compilers to their own local or national conventions, it is at their own risk. >But now, for the difference between plain text and programs. In addition >............................................ >Now, you probably understand IBM's reluctance to a single universal CECP. This passage is completely incomprehensible to me, because nothing is printed here as was presumably intended. >BITNET is primarily designed for transferring messages, i.e. plain >text. Let's set a comparatively humble goal, for the moment: > BITNET should transmit any plain text consisting of characters > from the ISO 8859-1 character set (i.e. GCS 00697) sensibly > and undisturbed. BITNET is not transmitting characters, but bytes. At receiving a text, it has to be interpreted, shown on a screen, or printed. This requires a key, as with every coded message. For BITNET there is a default, 94-character EBCDIC, not ASCII or ISO8859-1. Any deviations are "subject to mutual agreement between the interchange parties", as ISO uses to formulate it. This does not mean that use of the other bytes is forbidden, only there is no fixed interpretation for them. As there is no agreement between Mr. Stolz and me about the meaning of the codes he uses for German letters, I cannot read his German texts. >The price for sending out the notes in a local code variant (well, that's >the very procedure, most sites are following right now) will be the >obligation of translating incoming messages. So, every site will use at >most 9 (of the possible 90) translation tables. Again, this could be >done via SET INPUT and SET OUTPUT, as MM suggested (that's exacly the >way, I read notes from USA and elswhere). Later, the mailer, or RSCS, >or some similar software piece, would do the translating for all incoming >files, and the end-user will cease bothering with the details. Again, this is a misconception. We want to extend the default set of 94 characters, by a common agreed code table. Like people who pass their national border have to speak a different language, we should change our computer-using habits, when using BITNET. We are used to speak English internationally, we should agree upon one single code system, be it computer-English or computer-Esperanto. And if a single code page is technically not possible, we should tell the recipient beforehand which internationally accepted alternative he is going to receive. We do not even need IBM for doing this. We only have to provide a local facility to produce the right codes from a local text, and a printer with all the necessary characters. All the translate tables required we can build ourselves. (The 3800 software, even the non-APA, allows you to construct your own character tables.) >for a feasable common base of our half-tables. But, what about IBM's >character identifiers, accompanying the GCS and CECP tables? Instead of >"small letter a", we could use "LA01"; instead of "small letter a with >grave accent" we say "LA13", and instead of "small diphthong a with e", >we have "LA51". Those identifiers are not IBM's, but those found in ISO6937-2, with a few extensions, such as for the Dutch guilder. Yours faithfully, Johan van Wingen FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 22-Apr-88 09:52:59-EST,3037;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 22 Apr 88 09:52:54-EST Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 22 Apr 88 09:52:45 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 8564; Fri, 22 Apr 88 09:52:44 EDT Received: by BITNIC (Mailer X1.25) id 9098; Fri, 22 Apr 88 09:51:57 EDT Date: Fri, 22 Apr 88 13:19:48 +0200 Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Andre PIRARD Subject: Re: BITNET's current mapping To: Frank da Cruz In-Reply-To: Message of Wed, 20 Apr 88 19:09:20 EST from >What is BITNET's current ASCII-EBCDIC standard? >and where may one obtain a copy? >Thanks in advance. ASCII/EBCDIC translation on BITNET occurs in gateways connecting it to ASCII networks. For long, WISCVM drained a lot of the public traffic (from which it died). Successors probably have inherited the same tables. ASCII end nodes deal with their own requirements. I have found WISCVM tables matching standard CMS KERMIT tables, dumped below, as well as many other products and consequent ASCII data stored on servers. Probably many other gateways use the same. These tables have been tested with UUENCODED data, involving the 95 characters, that's all but control codes and DEL. Focus on A->E, E->A has a couple of non-revertible additions. Sorry for a straight dump only, this has to be a quick answer. Andr) TDUMP A (ASCII -> EBCDIC) 00010203 372D2E2F 1605250B 0C0D0E0F 10111213 3C3D3226 18193F27 1C1D1E1F 405A7F7B 5B6C507D 4D5D5C4E 6B604B61 F0F1F2F3 F4F5F6F7 F8F97A5E 4C7E6E6F 7CC1C2C3 C4C5C6C7 C8C9D1D2 D3D4D5D6 D7D8D9E2 E3E4E5E6 E7E8E9AD E0BD5F6D 79818283 84858687 88899192 93949596 979899A2 A3A4A5A6 A7A8A9C0 4FD0A107 00010203 372D2E2F 1605250B 0C0D0E0F 10111213 3C3D3226 18193F27 1C1D1E1F 405A7F7B 5B6C507D 4D5D5C4E 6B604B61 F0F1F2F3 F4F5F6F7 F8F97A5E 4C7E6E6F 7CC1C2C3 C4C5C6C7 C8C9D1D2 D3D4D5D6 D7D8D9E2 E3E4E5E6 E7E8E9AD E0BD5F6D 79818283 84858687 88899192 93949596 979899A2 A3A4A5A6 A7A8A9C0 4FD0A107 TDUMP E (EBCDIC -> ASCII) 00010203 0009007F 0000000B 0C0D0E0F 10111213 00000800 18190000 1C1D1E1F 00000000 000A171B 00000000 00050607 00001600 00000004 00000000 1415001A 20000000 00000000 00005C2E 3C282B7C 26000000 00000000 00002124 2A293B5E 2D2F0000 00000000 00007C2C 255F3E3F 00000000 00000000 00603A23 40273D22 00616263 64656667 6869007B 00000000 006A6B6C 6D6E6F70 7172007D 00000000 007E7374 75767778 797A0000 005B0000 00000000 00000000 00000000 005D0000 7B414243 44454647 48490000 00000000 7D4A4B4C 4D4E4F50 51520000 00000000 5C005354 55565758 595A0000 00000000 30313233 34353637 38397C00 00000000 23-Apr-88 22:37:33-EST,1548;000000000001 Return-Path: Received: from rutgers.edu by CU20B.COLUMBIA.EDU with TCP; Sat 23 Apr 88 22:37:28-EST Received: by rutgers.edu (5.54/1.15) id AA25258; Sat, 23 Apr 88 19:15:30 EDT Received: by ucbvax.Berkeley.EDU (5.59/1.28) id AA13912; Fri, 22 Apr 88 22:27:13 PDT Received: from USENET by ucbvax.Berkeley.EDU with netnews for protocols@rutgers.edu (protocols@rutgers.edu) (contact usenet@ucbvax.Berkeley.EDU if you have questions) Date: 22 Apr 88 20:08:03 GMT From: mnetor!utzoo!utgpu!water!watmath!egisin@uunet.uu.net (Eric Gisin) Organization: U of Waterloo, Ontario Subject: Re: UUCP over X25 on Sun 3 Message-Id: <18471@watmath.waterloo.edu> References: <287@tauros.UUCP>, <19772@pyramid.pyramid.com>, <20060@pyramid.pyramid.com> Sender: protocols-request@rutgers.edu To: protocols@rutgers.edu In article <20060@pyramid.pyramid.com>, csg@pyramid.pyramid.com (Carl S. Gutekunst) writes: > [...] > The 7-bit-printable-ASCII restriction comes from international X.25 gateways, > many of which insist on swiping the eigth bit for parity or somesuch. A few > also do funny mappings of control characters, like munging tabs. If I set up a > raw X.25 virtual circuit between here and West Germany, it will be 7 bits and > there is nothing I can do about it. It's difficult to believe CCITT is so stupid to allow this in X.25 VCs. Maybe I'll have one last look at the red book to verify it. What happens if one wants to run IP, DECNET, or OSI across such a gateway? I guess you don't. 26-Apr-88 12:48:20-EDT,1440;000000000001 Mail-From: SY.FDC created at 26-Apr-88 12:48:17 Date: Tue 26 Apr 88 12:48:17-EDT From: Frank da Cruz Subject: MacKermit modifications To: placeway@TUT.CIS.OHIO-STATE.EDU cc: sy.christine@CU20B.COLUMBIA.EDU Message-ID: <12393565441.48.SY.FDC@CU20B.COLUMBIA.EDU> Paul, here are more people champing at the bit... Sounds like they're working on a pretty old version, but the ISO8859 stuff is a big plus for the Europeans. Any news? Haven't heard from you for a while, and we're starting to get a little anxious... - Frank --------------- To: hafro!comp-protocols-kermit@uunet.UU.NET Path: krafla!frisk From: mcvax!rhi.hi.is!frisk@uunet.UU.NET (Fridrik Skulason) Newsgroups: comp.protocols.kermit Subject: MacKermit modifications Date: 25 Apr 88 17:27:12 GMT Organization: University of Iceland (RHI) Here at the University we have made a few modifications to MacKermit * #ifdef..#endif for MPW * Full 8 bit (ISO 8859/1) Terminal emulation support with automatic character set conversion. Is anyone else working on similar changes ? If not, do you think someone would be interested in receiving our modifications ? Fridrik Skulason University of Iceland UUCP frisk@rhi.uucp BIX frisk ------- 29-Apr-88 11:41:39-EDT,2968;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 29 Apr 88 11:41:38-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 29 Apr 88 11:38:13 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6897; Fri, 29 Apr 88 11:37:55 EDT Received: by BITNIC (Mailer X1.25) id 5773; Fri, 29 Apr 88 11:37:17 EDT Date: Thu, 28 Apr 88 22:10:00 MET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: corrections To: Frank da Cruz Dear list subscribers Here are a few corrections and additions to my previous letters. >printer set. This is what we use here. As long there is universal code >we should agree on temporary solutions. What is your proposal for >transmitting texts? printer set. This is what we use here. As long there is NO universal code we should agree on temporary solutions. What is your proposal for transmitting texts? The translation of Lepoello's aria is: Madamina! Dear Miss! Il catalogo \e questo Here is the catalog delle belle ch'am\o il padron mio; of the beauties my master courted; un catalogo egli \e, ch'ho fatto io; it is a catalog that I made myself; osservate, leggete con me! look, read with me! In Italia sei cento e quaranta; In Italy 640, in Alemagna cento e trent'una; in Germany 131, cento in Francia, 100 in France, in Turchia novant'una; in Turkey 91, ma in Ispagna son gi\a mille e tre! but in Spain it are 1003! I hope that the number of entries in IBM's Code Page Catalog is considerably less. In the following table $o and $O should be moved one column to the right REPRESENTATION OF LETTERS FROM ISO 8859-1 WITH CP500 TABLE 4. 5. 6. 7. 8. 9. A. B. C. D. E. F. .0 $o $O .1 /e /E a j A J 1 .2 ^a ^e ^A ^E b k s B K S 2 .3 %a %e %A %E c l t C L T 3 .4 \a \e \A \E d m u D M U 4 .5 /a /i /A /I e n v E N V 5 .6 ~a ^i ~A ^I f o w F O W 6 .7 @a %i @A %I g p x G P X 7 .8 $c \i $C \I h q y H Q Y 8 .9 ~n &s ~N i r z I R Z 9 .A .B ^o ^u ^O ^U .C $d &a $D %o %u %O %U .D /y /Y \o \u \O \U .E $p &A $P /o /u /O /U .F ~o %y ~O Best regards, Johan van Wingen FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 10-May-88 08:17:46-EDT,3091;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 10 May 88 08:17:44-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 10 May 88 08:15:05 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 7650; Tue, 10 May 88 08:15:04 EDT Received: by BITNIC (Mailer X1.25) id 7519; Tue, 10 May 88 08:16:11 EDT Date: Tue, 10 May 88 13:08:00 MET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: EBCDIC-1992 To: Frank da Cruz Dear list subscribers After having read again the whole of the discussion I think a solution can be proposed that is realizible without asking IBM or our managements for any important changes. 1. We take ISO8859-1 as is, other parts are for later consideration. 2. We agree on a single, universal and international version (code page) of EBCDIC, that contains all the characters from ISO8859-1. This we call EBCDIC-1992. 3. We agree on a single translate table between ISO8859-1 and EBCDIC-1992. (It is clear that the code page determines the translate table, or vice versa. Which is fixed first does not really matter, it depends on what is the more difficult: changing the existing translate tables, or the code pages.) These points I call the 1992-convention. (The details are subject to further discussion.) People adhering to this convention will send and receive their EARN/BITNET files using EBCDIC-1992. This requires at an IBM installation: a. a program that that is able to translate files from EBCDIC-1992 to local EBCDIC (such programs I write in SNOBOL straight away) and back. b. a printing facility providing all the 190 characters from ISO8859-1. c. a typing convention enabling the user to type transliterations of the 190 characters, using only 47 keys on a normal keyboard. d. a program converting the transliterations to local EBCDIC or EBCDIC-1992 (see c.). Except for the printer, this scheme does not require any new hardware on the IBM side, (I know too little from VAX to tell what has to be done there. Anyway, the conversion table must be installed.) One advantage of choosing a version of EBCDIC for text interchange is that texts appearing on the screen are readable to a large extent, because EBCDIC-1992 does not change the simple Latin letters, the digits and many specials. It may be that after much negotiation with IBM a better solution will turn up, but that can last for years. We need something now. I am awaiting your reaction. Yours faithfully, Johan van Wingen FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 17-May-88 17:05:13-EDT,4311;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 17 May 88 17:05:09-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 17 May 88 17:02:29 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 7118; Tue, 17 May 88 17:02:27 EDT Received: by BITNIC (Mailer X1.25) id 0255; Tue, 17 May 88 16:41:08 EDT Date: Tue, 17 May 88 11:31:27 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Michael Sperberg-McQueen Subject: Extended character sets, translation table implemented To: Frank da Cruz Remembering Howard Gilbert's warning "A curse upon anyone who actually puts them into production before the community as a whole agrees to them," I have been careful not to put them into production yet, but we have now successfully installed for testing the translate tables for ISO 8859/1 and the U.S. Country Extended Code Page which were discussed here. The translations for graphics are those proposed by Howard Gilbert in the message posted on 15 march 1988 (original date 3 September 1987) as modified in the subsequent discussion and summarized in Alain Fontaine's posting of 25 March 1987. Control characters, including DUP and FM, are handled as in the standard Yale ASCII / IBM 7171 tables (except that CR, NL, EM, and FF are *not* given their own code points in the protocol converter: any device that needs the special handling provided by Yale ASCII will have to use the vanilla tables). So far, so good. I had a little trouble getting the 7171 to use the correct (modified) tables, and of course they take a Kb or so of precious RAM, but they seem to work correctly. When I load the upper half of ISO 8859/1 into my IBM3163 and display a hex chart, it looks like the U.S. Country Extended Code Page distributed last August at SHARE. (Will someone who knows please tell those of us who can't get the documents whether the US CECP is Code Page 500 or Code Page 037, or what? If you want, I'll send you a list of code points and their graphics, but somebody *please* tell me what code page we've implemented!) Results: one problem so far. Because ISO, against its own rules (ISO 2022), put a graphic in position FF, which cannot be mapped into a seven-bit data stream, that graphic (y-with-diaeresis, EBCDIC DF in the US CECP) does not show up on my screen. Questions for the group: 1 should EBCDIC DF be returned as an illegal graphic, or what? (I'm letting it be, even though it doesn't display. So far it doesn't seem to be blowing anything up.) 2 should any compromises be made with the 94-character ASCII-to-EBCDIC translations users have become accustomed to? That is, should the 188-character translate tables allow users to type ASCII carat and get EBCDIC logical not, or should they insist on the new ASCII logical not? In other words, should the 188-character tables be a superset of the 94-character tables, or not? (I thought about this, at the last minute, and then decided I didn't want to have to explain, ten years from now, that we had a chance for a 1-to-1 ASCII-EBCDIC conversion and passed it up because users were used to typing '5' and getting '^'. That's "typing '&carat.' and getting '&logicalnot.'," for those who aren't looking at the same kind of screen I am.Y And the brackets I see are BA and BB -- apologies to those with TN print trains.Y So I implemented the table as it was discussed here, no compromises with the old compromise solutions. Before we make this public here, I expect to write a couple execs to set and clear various input and output mappings for CMS and Xedit, so people can simulate the 94-character translate tables in CMS and Xedit, when they don't need the whole thing.) Anyone interested in copies of the code (host-based Series/1 and 7171 macros) can have them if you send me your name and Bitnet address. Michael Sperberg-McQueen, University of Illinois at Chicago 18-May-88 06:44:16-EDT,4044;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 18 May 88 06:44:12-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 18 May 88 06:41:33 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 7674; Wed, 18 May 88 06:41:31 EDT Received: by BITNIC (Mailer X1.25) id 6538; Wed, 18 May 88 06:42:30 EDT Date: Wed, 18 May 88 11:15:18 +0200 Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues Comments: Resent-From: Andre PIRARD Comments: Originally-From: Andre PIRARD From: Andre PIRARD Subject: Re: Extended character sets, translation table implemented To: Frank da Cruz In-Reply-To: Your message of Tue, 17 May 88 11:31:27 CDT >precious RAM, but they seem to work correctly. When I load the upper >half of ISO 8859/1 into my IBM3163 and display a hex chart, it looks >like the U.S. Country Extended Code Page distributed last August at >SHARE. (Will someone who knows please tell those of us who can't get >the documents whether the US CECP is Code Page 500 or Code Page 037, or >what? If you want, I'll send you a list of code points and their >graphics, but somebody *please* tell me what code page we've >implemented!) In theory, your US CECP is code page 037. I may have a quick check if you like. But 037 has exclamation mark at 5A. 500 has closed bracket there. To my dismay, I am bound to implement cp 500, but I'll make it comment switchable between cp 037 and cp 500. >Results: one problem so far. Because ISO, against its own rules (ISO >2022), put a graphic in position FF, which cannot be mapped into a >seven-bit data stream, that graphic (y-with-diaeresis, EBCDIC DF in the >US CECP) does not show up on my screen. You're right! Johan will probably wake up on this one. I'll forward the question to someone of our brains not on the list. >1 should EBCDIC DF be returned as an illegal graphic, or what? (I'm >letting it be, even though it doesn't display. So far it doesn't seem >to be blowing anything up.) I think you relate to DF as being the RATS code to which invalid EBCDIC codes are translated. It is finally output to the terminal as whatever the terminal table contains at offset DF. So, in my mind, DF can be replaced by anything else, unless there is some hard coded test for that value somewhere in the code, but I see no reason why. In fact, this is really the heart of the problem. Which codes in the RATS or terminal table has a hard coded value? It seems the two-level translation allows for choosing any value in the RATS, which is the 7171 own internal hidden business. So, choosing it to be ISO8859 is a matter of convenience. A couple of deviations here and there won't hurt if clearly documented. >2 should any compromises be made with the 94-character ASCII-to-EBCDIC >translations users have become accustomed to? That is, should the >188-character translate tables allow users to type ASCII carat and get >EBCDIC logical not, or should they insist on the new ASCII logical not? >In other words, should the 188-character tables be a superset of the >94-character tables, or not? I hate it. You have to switch to APL mode to reach ISO2022 haven't you? I prefer to implement two modes: 1) non-APL for older terminals, for which installation dependent compromises are acceptable for convenience. 2) APL mode switched to by intelligent terminals (mostly micros) which have their elaborate keyboard redefinition anyway. Else, one will one day find true ISO hardware and start the story all over again. A standard is a standard. And to be a little bit incompatible ... Any comment? Andr). 18-May-88 13:35:57-EDT,4516;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 18 May 88 13:35:43-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 18 May 88 13:33:02 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 8234; Wed, 18 May 88 13:33:00 EDT Received: by BITNIC (Mailer X1.25) id 2709; Wed, 18 May 88 13:33:52 EDT Date: Wed, 18 May 88 10:46:13 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Michael Sperberg-McQueen Subject: Further on CP 037 DF = ISO 8859/5 FF = (SO) 7F. Also APL mode To: Frank da Cruz Many thanks to Andre (Andr)) Pirard for his comments on my last posting. He is right that the special treatment of EBCDIC DF (= ISO8859/1 FF) could reflect the special treatment of code point DF inside the Series/1 or 7171, instead of or in addition to complications for the seven-bit terminal data stream that now must accept a 7F (in the G1 set) as a graphic. To test this hypothesis, I altered the translation table on our test Series/1, from: EBCDIC Series/1 Terminal 8E (thorn) DE (G1) 7E (thorn) = ISO8859/1 FE DF (y-diaer.) DF (G1) 7F (y-diaer.) = ISO FF to: EBCDIC Series/1 Terminal 8E (thorn) DF (G1) 7E (thorn) = ISO8859/1 FE DF (y-diaer.) DE (G1) 7F (y-diaer.) = ISO FF In both cases, lowercase thorn displayed properly; y with diaeresis did not display. So we can infer that the internal code point DF is not a magic number that is used by other portions of the code, and that my trouble displaying y-diaeresis results from its position in the ISO8859/1 table. It seems likely that the Series/1 is actually sending out the desired shift-out + 7F to the terminal; I don't have a line monitor but putting the terminal into transparent mode allows me to see that it is receiving something, which it displays the same way as it displays a 7F. But the terminal, in normal operation, simply ignores ASCII DEL (7F) when it is received, and so I cannot make ISO8859/1 FF display. When equipment built to handle ISO8859 comes out, I assume this will not be a problem (unless one's data line refuses to transmit DEL); a PC running Yterm, similarly, may have no trouble with the 7F. It will only be devices which rely on the ISO 2022 definition of eight-bit sets which will have trouble. 2. APL mode. > You have to switch to APL mode to reach ISO2022 haven't you? I am not quite sure what is meant. On the IBM3163 I must hit an ALT-CHR key (a control-key combination) to get to the G1 set; not too hard. In the VM host I do *not* need to set the CP APL mode, or SET APL ON in Xedit. When I do issue a SET APL ON in Xedit, I hang the Series/1, for reasons I don't understand. Probably it has to do with the fact that we use SI/SO, but not the standard 3278 or 3277 APL conventions. The tables we are using do not emulate either the 3277 or the 3278 APL support -- that is, they do not insert 1D (IFS) or 08 (Graphic Escape) in front of the extended characters when handing characters to the host, nor do they expect 1D or 08 in front of the characters when taking a write from the host. That may be a dumb way to do it, and it certainly makes the protocol converter look different from a real 3278 or 3277. But it is simpler to see what is going on inside the protocol converter, and 3277s and 3278s don't support code page 037 anyhow. So I followed the lead of Tom Denier (sp?) at Penn, from whom we got the tables for support of the ALA character set. Seems to work okay, and I don't have to issue any special Xedit commands. (We did have to change the Xedit module to treat hex 41 through FE as displayable characters. When we made FF displayable, it caused terminal errors and dropped us into line-mode Xedit, so we backed out of that. This risks data loss if extended-EBCDIC files are edited on real 3270 devices, and I keep expecting data loss if they are edited on terminals which use the standard tables. But so far, we haven't had any data lost at all. Knock on wood!) Michael Sperberg-McQueen 19-May-88 10:01:05-EDT,2066;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Thu 19 May 88 10:00:44-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Thu, 19 May 88 09:50:44 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 9434; Thu, 19 May 88 09:50:43 EDT Received: by BITNIC (Mailer X1.25) id 7390; Thu, 19 May 88 09:51:39 EDT Date: Thu, 19 May 88 13:55:52 +0200 Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Andre PIRARD Subject: Re: Further on CP 037 DF = ISO 8859/5 FF = (SO) 7F. Also APL mode To: Frank da Cruz In-Reply-To: Message of Wed, 18 May 88 10:46:13 CDT from >> You have to switch to APL mode to reach ISO2022 haven't you? > >I am not quite sure what is meant. On the IBM3163 I must hit an ALT-CHR >key (a control-key combination) to get to the G1 set; not too hard. In >the VM host I do *not* need to set the CP APL mode, or SET APL ON in >Xedit. When I do issue a SET APL ON in Xedit, I hang the Series/1, for >reasons I don't understand. Probably it has to do with the fact that we >use SI/SO, but not the standard 3278 or 3277 APL conventions. I mean activating APL mode on the 7171 or S/1 only. This, on the 7171, uses an alternative set of terminal translate tables. My hope is that it would make two translation modes possible. My fear is that it would trigger the host input APL escaping too. And I would neither like nor dare turning on APL mode on the host. The converters switch to and from APL mode with the setup functions invoked by "accent" "lower/upper letter A". Is that your ALT-CHR or does it mean the way you enter SO, which should be ctrl-N too isn't it? Andr). 19-May-88 12:45:19-EDT,1637;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Thu 19 May 88 12:45:14-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Thu, 19 May 88 12:42:36 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0126; Thu, 19 May 88 12:42:35 EDT Received: by BITNIC (Mailer X1.25) id 9131; Thu, 19 May 88 12:44:06 EDT Date: Thu, 19 May 88 17:11:00 MET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: iso2022 To: Frank da Cruz Dear list subscribers Just a small comment for the time being. >Results: one problem so far. Because ISO, against its own rules (ISO >2022), put a graphic in position FF, which cannot be mapped into a >seven-bit data stream, that graphic (y-with-diaeresis, EBCDIC DF in the >US CECP) does not show up on my screen. ISO8859 is NOT an 8-bit code extension of a 7-bit code, but an 8-bit code in itself. Rules for 8-bit codes have been defined in ISO4873, not in ISO 2022. It is allowed to take here a 94 or a 96 character set for G1. If the 7171 cannot manage 8-bit codes, the worse for the 7171. Yours faithfully, Johan van Wingen FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 19-May-88 18:25:01-EDT,6767;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Thu 19 May 88 18:24:58-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Thu, 19 May 88 18:22:21 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0575; Thu, 19 May 88 18:22:18 EDT Received: by BITNIC (Mailer X1.25) id 4167; Thu, 19 May 88 18:22:12 EDT Date: Thu, 19 May 88 12:49:00 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Rick Troth Subject: Headache To: Frank da Cruz I think it just hit me ... We have at least three "schemes" (code pages) for EBCDIC here at Texas A&M. This does not count various ASCII character sets. 1) IBM 3192, 3179G (newer terminals) the "ISO set", has all (most of) the ISO8859/1 characters 2) IBM 3180, 3279G, 3179 (older terminals) the "3180 set" 3) IBM 7171, TN Print Chain, JNET (VAX), Kermit, WISCNET the "7171 set". I here include a "raw EBCDIC" file with illustrations of how it displays on these three main groups of terminals. The big problem is not that 7171, WISCNET, et al, deviate from ISO, they were intended for mapping 7-bit ASCII to "some points" of EBCDIC (and thus are excused). But the real concern is that this 3180 set deviates everywhere. It is wierd ... has duplications all over the place! What do we call these sets? If the ISO set is CP037, then what is the 3180 set? * note: I used ^ in place of 5 for the sake of 7171 users. Raw EBCDIC How does this display on your screen? 0 1 2 3 4 5 6 7 8 9 A B C D E F 0            1                 2                3                 4 & a k   b +  . < ( + | 5 & ) *  [ %  c (  ! $ * ) ; ^ 6 - / _ \  ] ^   , : , % _ > ? 7 W   0 1 2 | V { ` : # @ ' = " 8 x a b c d e f g h i  $ s / . E 9 j k l m n o p q r   N  q ~ A H ~ s t u v w x y z o @ Z [ r y B 5 6 } 7 8 9 f ; < = Y ? ] X D C { A B C D E F G H I K J > h l m D } J K L M N O P Q R ! - u t #  E \ g S T U V W X Y Z   i d Q F 0 1 2 3 4 5 6 7 8 9 3 w p z '  IBM 3179G, 3192 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 -- -- -- -- -- -; -- -- -- -- -- -- -- -- -- -- 1 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 2 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 3 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 4 ^a %a \a /a ~a @a $c ~n /c . < ( + | 5 & /e ^e %e \e /i ^i %i \i $s ! $ * ) ; ^ 6 - / ^A %A \A /A ~A @A $C ~N -| , % _ > ? 7 $o /E ^E %E \E /I ^I %I \I ` : # @ ' = " 8 $O a b c d e f g h i << >> -d /y $P +- 9 ^0 j k l m n o p q r -a -o $a /, $A @X A /u ~ s t u v w x y z !! ?? -D /Y $p @R B /\ -L -Y ^. -f /s |P 14 12 34 |( |) ^_ .. // v= C { A B C D E F G H I - ^o %o \o /o ~o D } J K L M N O P Q R ^1 ^u %u \u /u ~u E \ S T U V W X Y Z ^2 ^O %O \O /O ~O F 0 1 2 3 4 5 6 7 8 9 ^3 ^U %U \U /U ~U IBM 3279G, 3179, 3180 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 -- -- -- -- -- -; -- -- -- -- -- -- -- -- -- -- 1 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 2 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 3 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 4 |( |) -L -Y Pt $X $s /s ^- /c . < ( + | 5 & ^0 \/ /\ .. // /, \a \e \i ! $ * ) ; ^ 6 - / \o \u ~a ~o %y \a \e /e -| , % _ > ? 7 \i \o \u %u $c %a %e %i %o ` : # @ ' = " 8 %u a b c d e f g h i ^a ^e ^i ^o ^u /a 9 /e j k l m n o p q r /i /o /u ~n \A \E A \I ~ s t u v w x y z \O \U ~A ~O Y A B E E I O U Y C %A %E %I %O %U ^A ^E ^I ^O C { A B C D E F G H I ^U /A /E /I l m D } J K L M N O P Q R /O /U ~N t #  E \ $a S T U V W X Y Z $o @a $c i d -; F 0 1 2 3 4 5 6 7 8 9 $A $O @A $C -*  IBM 7171, TN Print Chain 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 -- -- -- -- -- ; -- -- -- -- -- -- -- -- -- -- 1 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 2 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 3 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 4 -- -- -- -- -- -- -- -- -- \ . < ( + | 5 & -- -- -- -- -- -- -- -- -- ! $ * ) ; / 6 - / -- -- -- -- -- -- -- -- | , % _ > ? 7 -- -- -- -- -- -- -- -- -- ` : # @ ' = " 8 -- a b c d e f g h i -- -- -- -- -- -- 9 -- j k l m n o p q r -- -- -- -- -- -- A -- ~ s t u v w x y z -- -- -- |( -- -- B -- -- -- -- -- -- -- -- -- -- -- -- -- |) -- -- C { A B C D E F G H I -- -- -- -- -- -- D } J K L M N O P Q R -- -- -- -- -- -- E \ -- S T U V W X Y Z -- -- -- -- -- -- F 0 1 2 3 4 5 6 7 8 9 -- -- -- -- -- -- 20-May-88 08:04:17-EDT,1628;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 20 May 88 08:04:14-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 20 May 88 08:01:38 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0874; Fri, 20 May 88 08:01:37 EDT Received: by BITNIC (Mailer X1.25) id 2067; Fri, 20 May 88 08:02:51 EDT Date: Fri, 20 May 88 13:38:18 +0200 Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Andre PIRARD Subject: Re: iso2022 To: Frank da Cruz In-Reply-To: Message of Thu, 19 May 88 17:11:00 MET from >ISO8859 is NOT an 8-bit code extension of a 7-bit code, but an 8-bit >code in itself. Rules for 8-bit codes have been defined in ISO4873, not >in ISO 2022. It is allowed to take here a 94 or a 96 character set for >G1. If the 7171 cannot manage 8-bit codes, the worse for the 7171. Sorry Johan, but both the way ISO looks (80-9F free) and information from an IBM representative make it clear it was designed to be transmitted 7-bit wide. Else, it would have been a really lucky coincidence. I wonder which poor people this character affects. I agree with you 7-bit communication is nonsense. And parity is a hoax. But the 7171's are there next room. And they are dreadfully useful. Andr). 20-May-88 12:14:46-EDT,1571;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 20 May 88 12:14:40-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 20 May 88 12:11:57 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 1153; Fri, 20 May 88 12:11:55 EDT Received: by BITNIC (Mailer X1.25) id 5193; Fri, 20 May 88 12:13:03 EDT Date: Fri, 20 May 88 11:57:31 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edwin Hart Subject: Re: iso2022 To: Frank da Cruz In-Reply-To: Your message of Fri, 20 May 88 13:38:18 +0200 ISO 8859-1 is an 8-bit code. I have not seen any standards on how it is to be transmitted over a communications wire. It might be 8 data plus parity. However, the issue you are discussing is implementation using existing equipment. When you try to implement ISO 8859-1 characters using the 7171 or Yale ASCII Comm System on an IBM Series/1, then because these controllers were programmed for 7 data bits, you must use the ISO 2022 or ANSI X3.41 protocols to use the characters in columns 10 to 15. This is a different problem than transmitting an 8-bit code. How does DEC (Digital Equipment Corporation) do it with the new VT300 series of terminals? Ed Hart 20-May-88 13:55:09-EDT,3061;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 20 May 88 13:55:07-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 20 May 88 13:52:31 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 1254; Fri, 20 May 88 13:52:27 EDT Received: by BITNIC (Mailer X1.25) id 7552; Fri, 20 May 88 13:53:10 EDT Date: Fri, 20 May 88 11:53:56 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Michael Sperberg-McQueen Subject: 7-bit and 8-bit sets To: Frank da Cruz Many thanks to Johan van Wingen for his note. I apologize for my error in ascribing the rules on 8-bit code structures to ISO 2022. I once worked at a site where we had a fairly good collection of the ISO standards, but I don't have copies handy here, so my memory can slip. Has ISO 4873 been revised in the last ten years or so? And does 2022 really not talk at all about 8-bit sets? I was very sure that the ISO standards I read, when I studied them all a few years back, prohibited use of A0 and FF for graphics. The American standard which I think is the equivalent of ISO 2022 (and which I thought was *compatible* with all the relevant ISO standards, please correct me if I'm wrong) is ANSI X3.41 - 1974 ("American National Standard Code Extension Techniques for Use with the 7-Bit Coded Character Set of American National Standard Code for Information Interchange"). ANSI X3.41- 1974 (which I do have in front of me) does define the structure of 8-bit sets (despite its name) and provides for a 94-graphic G1 set: "The 8-bit code table consists of an ordered set of controls and graphic characters grouped as follows [...]: [...] (5) A set of ninety-four graphic characters allocated to columns 10-15, subject to the exception of positions 10/0 and 15/15" (Section 6.2). So even if ISO 8859/1 conforms with ISO standards, it doesn't conform with ANSI X3.41 unless ANSI has revised it very recently. I'm not sure I understand Ed Hart's note. He is right, of course, that we have to use ANSI X3.41 to transmit 8-bit codes over 7-bit wire. But how is that "a different problem than transmitting an 8-bit code"? Part of ANSI X3.41 *is* the definition of how to transmit 8-bit codes over 7-bit lines, stipulating that SI should be sent to switch from the G0 graphic set to the G1 graphic set, SO to switch back. (Switching between C0 and C1, the two sets of controls, seems to be handled by escape sequences not SI/SO.) Since A0 and FF are explicitly not part of the G1 set, ANSI X3.41 provides (sec. 9.3) that they should be represented, if they have to be, by a private escape sequence. Michael Sperberg-McQueen, University of Illinois at Chicago 20-May-88 14:50:08-EDT,3543;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 20 May 88 14:50:02-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 20 May 88 14:47:22 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 1400; Fri, 20 May 88 14:47:19 EDT Received: by BITNIC (Mailer X1.25) id 8740; Fri, 20 May 88 14:48:11 EDT Date: Fri, 20 May 88 13:03:29 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Phil Howard KA9WGN Subject: Re: 7-bit and 8-bit sets To: Frank da Cruz In-Reply-To: Your message of Fri, 20 May 88 11:53:56 CDT > From: Michael Sperberg-McQueen > ..... I once > worked at a site where we had a fairly good collection of the ISO > standards, but I don't have copies handy here, so my memory can slip. This is an interesting issue. I have found it to be difficult to get hold of ANY ISO documentation at a reasonable price. ANSI documents are reasonably priced and ANSI sells them directly within the USA. There was a company I found once in Washington DC that sold ISO at exhorbitantly high prices. I was told that two major factors went into this high price: extremenly slick production of the documents themselves and very expensive binders to hold them. Further the company charged a high markup. Finally, documents were bundled in such a way that nothing could be had for under $800 a few years ago. Does anyone know a reasonable source for ISO documents without any slick covers or excessive bundling or any form of profiteering? (such practices really should have no place in standardizing). Many documents about TCP/IP networking are readily available online. Are there any ISO documents online? > of ANSI X3.41 *is* the definition of how to transmit 8-bit codes over > 7-bit lines, stipulating that SI should be sent to switch from the G0 > graphic set to the G1 graphic set, SO to switch back. (Switching > between C0 and C1, the two sets of controls, seems to be handled by > escape sequences not SI/SO.) Since A0 and FF are explicitly not part of > the G1 set, ANSI X3.41 provides (sec. 9.3) that they should be > represented, if they have to be, by a private escape sequence. If the C0 and C1 sets were switched by SI and SO, that would make SI present in C0 only, and SO present in C1 only, and the opposing codes acting as no-op. I guess it would have worked, but maybe someone thought it wasteful to define 2 no-ops. +-----------------------------------------------------------------------+ | Phil Howard, KA9WGN bitnet: | | Research Programmer internet: | | Computing Services Office or unix: | | University of Illinois at U/C mci: 10222-1-217-BIG-MAIN | | 1304 West Springfield Avenue at&t: 10288-1-217-BIG-MAIN | | Urbana, IL 61801 sprint: 10333-1-217-BIG-MAIN | | Phil's corollary: "If I was able to fix it, it must have been broke!" | +-----------------------------------------------------------------------+ 20-May-88 10:23:53-EDT,9407;000000000015 Return-Path: <@CUVMA.COLUMBIA.EDU:A-PIRARD@BLIULG11.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 20 May 88 10:23:47-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 20 May 88 10:21:07 EDT Received: from VM1.ULG.AC.BE by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 1007; Fri, 20 May 88 10:21:05 EDT Received: by BLIULG11 (Mailer X1.25) id 1202; Fri, 20 May 88 16:15:41 +0200 Date: Fri, 20 May 88 15:26:25 +0200 From: Andre PIRARD Subject: Extended ASCII with Kermit To: Frank da Cruz Frank, Here is a rewrite of the note I asked you to stop. My main goal is to reach Kermit developers. I leave it for you to judge if Info-Kermit is the right audience. Are there other lists like IBM-KERMIT for each implementation? If yes, that's of course the good place, but I'd like some feedback despite the fact I can't subscribe to all of them. I agree it's a big piece for Info-Kermit, and it could raise a lot of discussion. But you will understand it's vital for many people and it could add Kermit a big plus for them. In fact, some are probably secretly half way for terminal mode with SI/SO and G sets. I can shorten it to somewhat more than the conclusion for I-K if you like. Because of our special problems with and treat to the Mac, I am sending this note to Matthias Aebi K116430 @ CZHRZU1A and Paul Placeway PLACEWAY @ OHIO-STATE.ARPA. I hope these addresses are still all-right. Thanks in advance. Andr). ------------------------- For publication: Dear Kermit developers, Abstract. In the course of implementing our own national character sets with Kermit for terminal mode and file transfer, my understanding of the problem evolved from confusion to (near) simplicity and from national to international. I think my findings will be of much interest to those having to deal with the Spanish, French, German, Italian, well, the American continent, Western Europe and many other languages. That's, for them, really interconnecting the majority of computers existing to-day. On request, I've tried to be as short as possible at the risk of skimming here and there. I sure won't blame those getting bored with the subject. They can skip to the conclusion and see just what it takes in Kermit terms. Conversely, those really interested will get more information from the standards and the ISO8859 list of BITNET's LISTSERV @ JHUVM and its archives. Finally I take the occasion to praise all those devoting much of their time to straightening things that had run havoc. It's their ideas I am conveying. But I am sure glad to help. I just hope my limited English will carry the message precisely. Detail. In the process of implementing extended characters transfer between micros and IBM mainframes, I relied on the extended capabilities of Kermit 370 conversion (thanks John!). I came to the conclusion that, for the sole IBM PC, I should set up to 9 different tables in order to support 3 EBCDIC tables x 3 "ASCII" tables. For the Macintosh, that's 3 more tables with the IBM host. I was unable to have Kermit do Mac to IBM PC conversion, unless endeavouring translation on the PC, 3 more tables or so. I hacked some limited national characters support for IBMPC terminal mode through the 7171, but our Mac users were left with a dumb nice keyboard and a deaf screen. Kermit implements two main files transfer modes. Binary mode defines how to transport a continuous string of bytes containing values only required to be meaningful to the originating and receiving final systems. No matter how stored on an intermediate one, it should forward or return the same byte string on the communication line. The point here is that each node operation is clearly defined, making it the best method when appropriate. Text mode, in contrast, defines how to transport *records* containing codes for "readable" characters intended to be usable -- and stored as such -- on any system. The protocol rules how to, on the line, stream the records in a system independent manner. Again, every node should forward the data unaltered, that is equivalent communication line encoding. The Kermit protocol wisely says that the ANSI X3.4 (ASCII) standard is to be used to represent these characters. It is the code used on most computers and those (IBM, Commodore) not using it have to deal with their own problem of code conversion. Most modern computers now implement an 8-bit extended character set in order to support, to various extents, languages requiring characters not found in ANSI X3.4 (I intentionally disregard the obsolete 7-bit remapping methods). Almost each does it its own way however, because there was no standard at the time they were devised (IBM even has multiple ones within a single system). Clearly, translation must occur somewhere to transmit extended text usefully between them. If it is done say by running a program in the receiving system, one must know and use the right table according to the sender. The mere at least 7 codes that I have to deal with make for 5040 tables in theory. In practice, what was a crystal clear matter as long as only X3.4 was used becomes a real puzzle with extended codes. As the number of tables grows, so does the problem factorially. To a lesser extent, the same problem holds for terminal mode. It occurs only when a computer supports remote terminals, but we must fiddle with a 7-bit data path, an issue solved per se by the Kermit protocol in the first case. It is evident that the problem lies in each machine's dealing with the others' own business, and that the solution is to have them talk a common code on the line, as it is now with X3.4 and for those not using it. Imposing them to use that code internally is impractical, although recommendable. But having each convert the data to/from that common code before/after transmission reduces the above example to a mere 6 tables pairs. What is striking is the technical simplicity of translating every character data byte that flows on a communication line to a common code everyone agrees about. What is sorry is that we have to. What is moot is what common code should be used on the line. It is my strong feeling that Kermit itself translating national codes to make up for the lack of its host system using a standard will be *extremely* useful for people having to use these codes. This feature must be optional, because incompatible with previous use. It would be a shame to have two Kermit implementations for the same system corrupt data because one uses this feature and the other lacks it. The cause of the problem, a missing standard, does no longer exist. ISO 8859/1 = ANSI X3.134.2 = ECMA 94 has been defined and gathers every possible character extension for Latin group 1 users, by far the largest, plus other common symbols satisfying many computer brands. It looks like a very well thought out thing and several leading manufacturers have adopted it, or a pre- release because they couldn't wait, or modified their previous codes to conform to ISO (have exactly the same graphics, but use other codes points, in line with this proposition). That's IBM, DEC, Microsoft and Lotus for what I gathered. It looks like the future many, international and US, are working for. The on-the-way-ISO8859/x users should not be left out. The problem is parallel, but their codes will be untranslatable to ours. They might be expected to start with pure ISO right off. Until the 16 bits (some say 32) codes sets will be devised, but that's our children's Kermit probably. Conclusion. In summary, a Kermit implementation would be much enhanced for many people if simply: - it was optionally translating bytes during text mode file transfer (at the file I/O or equivalent level). Nothing elaborate is required to start this. Just a pair of null translation tables, easily found and patched, and a couple of code lines will cover both the "translation" and "optional" topics. - it was doing the same at the communication line I/O level during terminal mode and, when using 7-bits wide data path, implementing the ISO 2022 SO/SI feature to use the upper half of the set (shift out) and revert to the lower one (shift in). Several already do. That's all. But a welcome leap further would be to: - if a particular system does not conform to ISO (like the Mac, misses some of its graphics or uses others), define a best fit one to one correspondence between its character set(s) and ISO (there should be total agreement as to which, up to with the manufacturer). It must involve the 256 codes in a revertible way. - have systems supporting terminals do it in ISO mode, preferrably on an 8-bit wide line. - have these features bundled in options. Thanks for your patience in reading. Andr). Oops, not yet. Andre'. 20-May-88 17:09:24-EDT,2113;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 20 May 88 17:09:19-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 20 May 88 17:06:41 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 1739; Fri, 20 May 88 17:06:39 EDT Received: by BITNIC (Mailer X1.25) id 1594; Fri, 20 May 88 17:06:58 EDT Date: Fri, 20 May 88 13:55:00 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Rick Troth Subject: Re: iso2022 To: Frank da Cruz In-Reply-To: Your message of Fri 20 May 88 11:57:31 EDT Ed, et al - I do not have access to a VT300 series terminal, but on the VT200 series boxes: 1) 00-1F are the usual ASCII control codes 2) 20-7E are the usual ASCII graphics 3) 80-9F are new control codes, including CSI 4) A0-FE are new graphics, defaulting to DEC Supplemental (looks pretty much like A0-FE in ISO 8859/1) if you are in "Multi-National" mode. On a 7-bit wire: 3) 80-9F are represeted by ESC followed by one of 40-5F 4) A0-FE are displayed by SO, string of 20-7E, SI (thus APL support on 7171 can be fudged into ISO8859 support) (Phil - SO/SI does not affect controls) Examples: 3) The familiar (to VT100 users) cursor placement operation ESC row ; col H (7-bit) is equivalent to CSI row ; col H (8-bit) (that's "escape open-bracket ... ") 4) The cent sign can be displayed by A2 (hex, 8-bit) or with G1 as DEC Supplemental SO 22 (hex) SI (7-bit) - Rick 20-May-88 22:41:04-EDT,1374;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 20 May 88 22:41:01-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 20 May 88 22:38:17 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 1935; Fri, 20 May 88 22:38:15 EDT Received: by BITNIC (Mailer X1.25) id 4267; Fri, 20 May 88 22:39:42 EDT Date: Fri, 20 May 88 20:05:20 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John Kesich Subject: Re: Obtaining ANSI and ISO Standards To: Frank da Cruz In-Reply-To: Message of Fri, 20 May 88 15:35:30 EDT from Just so people have a feeling for what "reasonable" means in terms of ISO standard prices, here is a list of standards I picked up last December (at ANSI): standard price # pages ---------- ----- ------- ISO 8859-1 $22 7 ISO 8859-2 $20 6 ISO 6937-1 $27 12 ISO 6937-2 $50 37 In short, plan on spending about 4X more than you would for a comparable ANSI standard. 24-May-88 07:45:14-EDT,3619;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 24 May 88 07:45:08-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 24 May 88 07:42:43 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 4184; Tue, 24 May 88 07:42:41 EDT Received: by BITNIC (Mailer X1.25) id 1096; Tue, 24 May 88 07:43:26 EDT Date: Tue, 24 May 88 12:52:00 MET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: Reply to 7171 change To: Frank da Cruz Dear list subscribers So, Mr. Sperberg MacQueen wants to be cursed. My connections with hellish powers are not all that, but I'll try. I am certainly one in the "community" who does not agree, but I can respect Mr. SMQ's experiments, not, however, a proposal that only solves US problems. First, what is EBCDIC? If we take the yellow card (GX20-1850), we see two columns, one "standard", one for the T-11 and TN chains. Also, there are the GT10 type tables for the IBM 3800 printers, and the national variants (see 3270 manuals). The differences affect only a restricted number of graphics. ID NAME US TN Intern GT10 CP500 SM06 left square bracket -- AD 4A AD BA SM08 right square bracket -- BD 5A BD BB SM11 left curly bracket C0 8B C0 C0 C0 SM14 right curly bracket D0 8D D0 D0 D0 SC04 cent sign 4A 4A -- 4A 4A SP02 exclamation mark 5A 5A 4F 5A 5A SM13 vertical line 4F 4F -- 4F 4F SM65 broken vertical line 6A -- 6A 6A 6A SM07 reverse solidus (slash) E0 -- E0 E0 E0 SD19 tilde A1 -- A1 A1 A1 SD13 grave accent 79 -- 79 79 79 This is valid except for national variants at some of the 14 codes: 4A 5A 6A 79 5B 7B 7C 5F A1 C0 D0 E0 4F 7F ; following US are: Canadian Bilingual, English (UK), Hebrew, Japanese, Portuguese, Spanish; following International are: German, Belgian, Brazilian, Canadian French, Danish/Norwegian, Finnish/ Swedish, French, Italian, Swiss. The best test case is 4F: US/CP037: exclamation mark Int/CP500: vertical line As for the extensions, CP037 and CP500 seem to be identical. The NOT sign is a separate problem to be discussed later on. Second, what is ISO8859? Be warned, you do not solve anything if you include ISO8859-1 only in the discussion, (a note: it used to be ISO8859/1, but ISO changed very recently their rules for designating the Parts of a Standard, now it is ISO8859-1). There will be very soon a new set, called internally ISO-XYZ, being the harmonization of ISO6937 and 8859. SC2 will meet the week of 17 Oct. 1988 in London. Prepare your campaign, start to lobby now! But all this leaves the central question unanswered. Shall the code page be adapted to the translate table or the reverse? Mr. SMQ has shown that the translate table of the 7171 can be changed. Is that all? It is time to discuss the merits of the code pages. I'll keep my own opinion until my next contribution. Yours faithfully, Johan van Wingen FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 25-May-88 06:24:14-EDT,1288;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 25 May 88 06:24:09-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 25 May 88 06:21:58 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6244; Wed, 25 May 88 06:21:56 EDT Received: by BITNIC (Mailer X1.25) id 0291; Wed, 25 May 88 06:22:52 EDT Date: Wed, 25 May 88 02:28:00 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Richard Subject: Re:Reply to 7171 change To: Frank da Cruz Johan van Wingen says: >you do not solve anything if you include ISO8859-1 only in the discussion, I agree. The high order half if ISO8859-1 is little use to anyone. Far better to use Adobe's "Standard Encoding" or even Xerox's "Character Set 0" as a basis for an 8 bit ASCII. Both of these codes store accents as seperate characters instead of trying to store all possible combinations of accents and characters. 25-May-88 09:35:49-EDT,1560;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 25 May 88 09:35:47-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 25 May 88 09:33:38 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6396; Wed, 25 May 88 09:33:35 EDT Received: by BITNIC (Mailer X1.25) id 2455; Wed, 25 May 88 09:34:05 EDT Date: Wed, 25 May 88 08:56:48 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edwin Hart Subject: Usefulness of ISO 8859-1 To: Frank da Cruz In contrast to flying accent options, the ISO 8859-1 character set and code is very useful. It contains the necessary characters for over 40 countries. Compare that to ISO 646 and the National variations (one per language). ISO 8859-1 also has the extra characters needed to make the US 94 character EBCDIC and US ASCII X3.4 character sets match. ISO 8859-1 was developed because the computer manufacturers required a one-character per code point. However, when a printer implements the ISO 8859-1 code, nothing says that internally the printer could not use flying accents to form the characters. However, they need to be careful about the "i" character with accents like the umlaut. Ed Hart 25-May-88 09:36:58-EDT,2978;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 25 May 88 09:36:54-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 25 May 88 09:34:48 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6400; Wed, 25 May 88 09:34:47 EDT Received: by BITNIC (Mailer X1.25) id 2506; Wed, 25 May 88 09:35:17 EDT Date: Wed, 25 May 88 08:53:02 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John C Klensin Subject: RE: Re:Reply to 7171 change X-To: ASCII/EBCDIC character set related issues To: Frank da Cruz Now, come on. "Of little use to anyone" is clearly at least a bit of an exaggeration. And the store-the-character, store-the-position, store-the-accent strategy ignores two important problems: - For many purposes, these characters-with-extra-marks are CHARACTERS, not simply "some other character with an accent". - The programming language implications of trying to cope with characters and accents stored separately are pretty unpleasant. I'm not asserting that they cannot be made to work, but people keep assuming that * length(string) == number of characters in it * if length(string1) = length(string2), then they contain the same number of characters *and* occupy the same amount of storage (i.e., that either string1 or string2 can be copied into the storage occupied by the other). * that there is such a thing as character-width, and that characters can be extracted from strings and stored into a character-width object. * that a comparison for identity between character1 and character2 will be true iff they are the same character (and not that one of them is followed by an accent that changes its meaning). And * things can be sorted into collating order using simplistic bit-compare algorithms. I stipulate that some of those principles overlap, and that a smaller number of rules is possible. I also stipulate that one can design runtime to eliminate or hide all of the problems (given careful runtime and user programming), but suggest that such runtime would get little assistance from current hardware and, consequently, would tend to deliver unacceptable performance. character-overstrike_indicator-accent approaches are fine for page definition languages (I note that your two examples were both of that class), and are OK for a data communications stream that will be printed (or displayed), but not further processed, but really fairly poor for either information interchange or processing and text manipulation. John Klensin, MIT 25-May-88 10:17:04-EDT,1907;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 25 May 88 10:17:02-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 25 May 88 10:14:55 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6458; Wed, 25 May 88 10:14:53 EDT Received: by BITNIC (Mailer X1.25) id 3323; Wed, 25 May 88 10:10:41 EDT Date: Wed, 25 May 88 14:58:00 MET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: 7171 and Mr.Troth To: Frank da Cruz Dear list subscribers I have been inspecting Mr. Troth's tables. The first one (Raw EBCDIC) appears to have arrived quite correctly. If you turn HEX ON (under PDF) you see that all 256 codes are there. I tried it on a 3278, a 3192-G, a VT100 (by class=C71, that is by the 7171) and on a PC by KERMIT, also by C71, and there is no difference. Only if you look at the actual characters on the screen you see other representations. This implies that 8-bit codes are being transferred by BITNET correctly. Only as soon as you start interpreting those codes no longer as EBCDIC problems arise. But that is a matter of local changes to character interpretations of codes. If you turn on APL at your terminal, you get other things to see. Why not invent an ISO8859-1 button? This done, I do not understand what the fuss with the 7171 is about. Who is fooling whom? Yours faithfully, Johan van Wingen FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 25-May-88 15:53:48-EDT,2527;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 25 May 88 15:53:43-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 25 May 88 15:53:33 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 7079; Wed, 25 May 88 15:53:25 EDT Received: by BITNIC (Mailer X1.25) id 1608; Wed, 25 May 88 15:54:04 EDT Date: Wed, 25 May 88 11:46:00 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Rick Troth Subject: Re: Usefulness of ISO 8859-1 To: Frank da Cruz In-Reply-To: Your message of Wed 25 May 88 08:56:48 EDT oAmen to that, Ed! (Spanish syntax) The one-to-one ASCII to EBCDIC table(s) is *the strength* of the ISO8859-1 set. As I indicated in a recent long posting, we suffer from the confusion of three different EBCDIC's here at A&M. The problem is most clearly illustrated in the display of square brackets. Suppose some random user sits down at some random terminal. He logs in and reads his Inter-Net mail with embedded brackets. Whether he sees brackets or "something else" depends on what terminal he is using and what code points the brackets were translated to by whatever gateway passed the mail to BITNET. Y display as left and right brackets on a 3192 & a display as left and right brackets on a 3180 [ ] display as left and right brackets on a 7171 It so happens that the character set in the 3192 displays ALL of the characters in the ISO8859-1 set. The 3180 DOES NOT. The 7171 DOES NOT. If you map 7-bit ASCII to some of EBCDIC, then you may be able to put up with this. But ASCII machines are starting to use all all eight bits. Furthermore connectivity is the word of the day. Personally, I would hope that ISO8859-2 can be mapped to the coresponding national EBCDIC, and likewise for ISO8859-3, etc. I did not get the impression that Michael S-McQ nor anyone else on this list wants to "leave Europe out in the cold". But let's take one step at a time, please. How does "raw EBCDIC" display on your tube? - Rick 25-May-88 12:08:08-EDT,2494;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 25 May 88 12:08:03-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 25 May 88 12:02:00 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6766; Wed, 25 May 88 12:01:58 EDT Received: by BITNIC (Mailer X1.25) id 5126; Wed, 25 May 88 12:00:08 EDT Date: Wed, 25 May 88 10:06:12 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Phil Howard KA9WGN Subject: Re: Usefulness of ISO 8859-1 To: Frank da Cruz In-Reply-To: Your message of Wed, 25 May 88 08:56:48 EDT > However, when a printer implements the ISO 8859-1 code, nothing says that > internally the printer could not use flying accents to form the characters. > However, they need to be careful about the "i" character with accents like > the umlaut. Does this mean that "flying accents" are only formed by overstriking? Does a backspace control character separate the base character from its accent mark? ((( I would think this not necessary when designing a new code with ))) ((( the sophistication of today's computers. The accent code could ))) ((( be made to preceed the base character, and the accent code would ))) ((( imply a modification to the next coming base character. ))) > character sets match. ISO 8859-1 was developed because the computer > manufacturers required a one-character per code point. Not knowing the actual codes ISO puts out, it is hard to make specific comments since they may be really part of a different code. I once looked at a number of ways to do this myself. I looked at many languages and collected a list of different accents. Then, by combining them with the Roman alphabet, I came up with over 3000 possibilities. Double that again for Cyrillic. And that is just most of Europe. Still, the number of actually used accented letters in the various languages would put a stress on codifying them all in just 256 possible codes. Just how many different languages are being codified here? Does anyone have a list of them? Are these standards going to lock out certain languages? 25-May-88 12:36:07-EDT,4047;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Wed 25 May 88 12:36:04-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Wed, 25 May 88 12:25:10 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6800; Wed, 25 May 88 12:25:08 EDT Received: by BITNIC (Mailer X1.25) id 5801; Wed, 25 May 88 12:18:58 EDT Date: Wed, 25 May 88 10:18:42 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Phil Howard KA9WGN Subject: RE: Re:Reply to 7171 change To: Frank da Cruz In-Reply-To: Your message of Wed, 25 May 88 08:53:02 EST > - For many purposes, these characters-with-extra-marks are CHARACTERS, > not simply "some other character with an accent". > - The programming language implications of trying to cope with > characters and accents stored separately are pretty unpleasant. I'm not > asserting that they cannot be made to work, but people keep assuming > that > * length(string) == number of characters in it > * if length(string1) = length(string2), then they contain the same > number of characters *and* occupy the same amount of storage (i.e., that > either string1 or string2 can be copied into the storage occupied by the > other). > * that there is such a thing as character-width, and that characters > can be extracted from strings and stored into a character-width object. > * that a comparison for identity between character1 and character2 > will be true iff they are the same character (and not that one of them > is followed by an accent that changes its meaning). And > * things can be sorted into collating order using simplistic > bit-compare algorithms. Is it absolutely necessary that the representation of character codes INTERNAL to a machine, and EXTERNALLY (inter-machine communication) be identical? Clearly if not, a processing logic must be applied as a gateway in and out of a machine to transpose the code sets. This overhead is typical, however, given that many communications protocols even now include various forms of Huffman or Lempel-Ziv compression protocols. So, overhead is a weak argument. The last "I" in ASCII means "Interchange". Does the implications also apply in practice for ISO codes? > I stipulate that some of those principles overlap, and that a smaller > number of rules is possible. I also stipulate that one can design > runtime to eliminate or hide all of the problems (given careful runtime > and user programming), but suggest that such runtime would get little > assistance from current hardware and, consequently, would tend to > deliver unacceptable performance. How about a wider character code for INTERNAL machine processing where the convenience of fixed interval addressing is very important, and a RELATED EXTERNAL code for "Interchanging" these codes knowing that typical uses will involve small subsets of the overall code, making it possible to apply an "obvious" compression of selecting code subsets. Some data compression techniques can actually do this for you and make a 16-bit code set where less than 256 codes are typically used transmit just about as efficiently as if the codes had been defined in an 8-bit set. > character-overstrike_indicator-accent approaches are fine for page What's wrong with (accent_implying_zero_forward_space)-(character) coding? > definition languages (I note that your two examples were both of that > class), and are OK for a data communications stream that will be > printed (or displayed), but not further processed, but really fairly > poor for either information interchange or processing and text > manipulation. > John Klensin, MIT 27-May-88 20:42:03-EDT,1210;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 27 May 88 20:41:59-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 27 May 88 20:42:32 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0543; Fri, 27 May 88 20:42:31 EDT Received: by BITNIC (Mailer X1.25) id 0892; Fri, 27 May 88 20:42:29 EDT Date: Fri, 27 May 88 20:22:49 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John Kesich Subject: ECMA registered codes via DRCS? To: Frank da Cruz ISO2022 defines Dynamically Redefinable Character Sets (DRCS) - which essentially allows a user to define and load their own character set. VT200's and emulators (which means most terminals of recent vintage) support DRCS. Are there any DRCS's available for ECMA codes? If not, does anyone know of any software tools for creating DRCS's? 27-May-88 20:22:15-EDT,2696;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Fri 27 May 88 20:22:09-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Fri, 27 May 88 20:22:44 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0539; Fri, 27 May 88 20:22:43 EDT Received: by BITNIC (Mailer X1.25) id 0808; Fri, 27 May 88 20:22:29 EDT Date: Fri, 27 May 88 18:35:47 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John Kesich Subject: Re: Extended ASCII with Kermit To: Frank da Cruz In-Reply-To: Message of Fri, 27 May 88 14:44:05 +0200 from From my reading of ISO's 646, 2022, 4873, 8859-1 & 8859-2 I have come to the conclusion that there is a fairly widespread misunderstanding of ISO8859. If I'm the one who has misunderstood I hope someone will take the trouble to correct me. People seem to think that you pick one of the ISO8859-x sets and then those 256 characters are the only ones used. However, ISO's 2022 & 4873 define a number of escape sequences for switching among different versions (as they term character sets which conform to the standards). What this means is that simple translation table mappings are not enough to translate ISO to other code sets, one must also change translation tables 'on the fly' as the escape sequences are encountered. A somewhat simplified example may help to illustrate the problem: data stream (ISO notation) hex comments -------------- --- -------- ESC 02/00 04/12 1B 20 4C select level 1 of ISO4873 ESC 02/13 04/01 1B 2D 41 designate (and invoke) ISO8859-1's G1 set 12/00 C0 1st 'real' character - capital A, grave accent ESC 02/13 04/02 1B 2D 42 designate (and invoke) ISO8859-2's G1 set 12/00 C0 2nd 'real' character - capital R, grave accent Does an implementation which uses a single set of ISO8859-x characters conform to the standard? Even if it does, would it make any sense to standardize on a particular ISO8859-x to the exclusion of others? Finally, if one were to do so, how would the 2 character text in my example be transmitted? Any implementation which doesn't include the ISO escape sequences will eventually have to incorporate some such mechanism. I think the ISO escape sequences should be a part of any standard which is adopted. 28-May-88 06:13:40-EDT,1772;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:A-PIRARD@BLIULG11.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Sat 28 May 88 06:13:30-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Sat, 28 May 88 06:13:55 EDT Received: from VM1.ULG.AC.BE by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0716; Sat, 28 May 88 06:13:53 EDT Received: by BLIULG11 (Mailer X1.25) id 1548; Sat, 28 May 88 12:12:35 +0200 Date: Sat, 28 May 88 12:11:26 +0200 From: Andre PIRARD Subject: Precision to my ISO8859/1 document To: ISO8859@JHUVM, Frank da Cruz , IBM-KERMIT@CU20B.COLUMBIA.EDU, Paul Placeway , Matthias Aebi In the document describing the ISO8859/1 and related character sets, I forgot to make the following remark to be added to the file. Sorry. Andre'. - The character range 80-9F is undefined in the description of ISO885/1 I have. I don't know its real status, but this feature is welcome for two reasons. First, it avoids control characters during transmission on a 7-bit line (ISO2022: an SO code shifts to the upper half of the set, an SI code reverts to the lower one). As an added bonus, this keeps Kermit overhead (8-th bit quoting) to a minimum. Second, it allows rearranging a previous 8-bit code set that used this range for national characters. These are moved to the ISO positions and the expelled non-ISO characters can be moved to the 80-9F range. What appears in my listing is the assignment made by IBM for its graphic characters mainly. 30-May-88 14:19:26-EDT,2798;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Mon 30 May 88 14:19:21-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Mon, 30 May 88 05:53:51 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 1929; Mon, 30 May 88 05:53:50 EDT Received: by BITNIC (Mailer X1.25) id 0253; Mon, 30 May 88 05:53:10 EDT Date: Mon, 30 May 88 11:06:44 +0200 Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Andre PIRARD Subject: Re: Extended ASCII with Kermit To: Frank da Cruz In-Reply-To: Message of Fri, 27 May 88 18:35:47 EST from >People seem to think that you pick one of the ISO8859-x sets and then >those 256 characters are the only ones used. However, ISO's 2022 & 4873 >define a number of escape sequences for switching among different >versions (as they term character sets which conform to the standards). >What this means is that simple translation table mappings are not enough >to translate ISO to other code sets, one must also change translation >tables 'on the fly' as the escape sequences are encountered. A somewhat >simplified example may help to illustrate the problem: > >data stream >(ISO notation) hex comments >-------------- --- -------- >ESC 02/00 04/12 1B 20 4C select level 1 of ISO4873 >ESC 02/13 04/01 1B 2D 41 designate (and invoke) ISO8859-1's G1 set >12/00 C0 1st 'real' character - capital A, grave accent >ESC 02/13 04/02 1B 2D 42 designate (and invoke) ISO8859-2's G1 set >12/00 C0 2nd 'real' character - capital R, grave accent That's the way to build a super terminal to display data from a super text processor that can manage all languages simultaneously. But how will this processor store its text? Not in a plain 8-bit text file obviously. And that's what's I was talking of: transferring to-day's 8-bit files that store one version of ISO8859 and terminal support for that one version of code. Let's first agree on how to do that. File transfer of more elaborate data will have to encode the data for integrity anyway. So, the ISO scheme can apply only to terminal mode. But thanks for the information John. By the way, could you describe in a couple of lines how ISO defines switching between the two halves of a single 8-bit set with SI/SO for a 7-bit line? The mechanism looks fairly obvious, but I would hate missing some subtle feature. Andr). 31-May-88 08:33:12-EDT,1095;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 31 May 88 08:33:10-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 31 May 88 08:33:58 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 2888; Tue, 31 May 88 08:33:56 EDT Received: by BITNIC (Mailer X1.25) id 0442; Tue, 31 May 88 08:33:32 EDT Date: Tue, 31 May 88 08:24:11 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edwin Hart Subject: Re: Precision to my ISO8859/1 document To: Frank da Cruz In-Reply-To: Your message of Sat, 28 May 88 12:11:26 +0200 ISO 8859-1 columns 8 and 9 (X'80' to X'9F') are reserved for the C1 control character set. They may not be used for (printable) characters, only for control characters. 31-May-88 09:15:25-EDT,3911;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 31 May 88 09:15:17-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 31 May 88 09:16:08 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 2930; Tue, 31 May 88 09:16:07 EDT Received: by BITNIC (Mailer X1.25) id 2115; Tue, 31 May 88 09:14:59 EDT Date: Tue, 31 May 88 15:00:00 MET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: What is EBCDIC? To: Frank da Cruz Dear list subscribers It is very difficult indeed to be clear and precise. Thus I have to present a corrected version of the EBCDIC part of my "Reply to 7171 change". As CP037 and CP500 are not available here, please send me any correction to these tables, in order that we know what we are speaking about when we are discussing variants of EBCDIC. Yours faithfully, Johan van Wingen """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" First, what is EBCDIC? We consider for this moment the basic set with 94 characters only. If we take the yellow card (GX20-1850), we see two columns, one "standard", one for the T-11 and TN chains. Also, there are the GT10 type tables for the IBM 3800 printers. Further, there are national variants, based on "US" and "International", (see IBM3270 Information Display System, Character Set Reference, GA27-2837-9, Figure 10-43). Finally, there are CP037 and CP500, containing extensions. (The 4250 code pages, which are still more different, are left out.) The differences affect only a restricted number of graphics. (the CP037 and CP500 are only a guess, please send corrections) Interna ISO ID NAME US tional TN GT10 CP037 CP500 8859-1 SM06 left square bracket -- 4A AD AD BA 4A 5B SM08 right square bracket -- 5A BD BD BB 5A 5D SM11 left curly bracket C0 C0 8B 8B C0 C0 7B SM14 right curly bracket D0 D0 9B 9B D0 D0 7D SC04 cent sign 4A -- 4A 4A 4A 4A A2 SP02 exclamation mark 5A 4F 5A 5A 5A 4F 21 SM13 vertical line 4F -- 4F 4F 4F 5A 7C SM65 broken vertical line 6A 6A -- -- 6A 6A A6 SM07 reverse solidus (slash) E0 E0 -- E0 E0 E0 5C SD19 tilde A1 A1 -- -- A1 A1 7E SD13 grave accent 79 79 -- -- 79 79 60 SM66 not sign 5F 5F 5F 5F 5F 5F AC SD15 circumflex accent -- -- -- -- B0 B0 5E This is valid except for national variants at some of the 14 codes: 4A 5A 6A 79 5B 7B 7C 5F A1 C0 D0 E0 4F 7F ; following US are: Canadian Bilingual, English (UK), Hebrew, Japanese, Portuguese, Spanish; following International are: German, Belgian, Brazilian, Canadian French, Danish/Norwegian, Finnish/ Swedish, French, Italian, Swiss. The best test case for determining your set is 4F: US/CP037: exclamation mark International/CP500: vertical line As for the extensions, CP037 and CP500 seem to be identical, (TN and GT10 have different extensions). The NOT sign is a separate problem to be discussed later on. """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 31-May-88 21:48:06-EDT,4711;000000000001 Return-Path: <@CUVMA.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.COLUMBIA.EDU by CU20B.COLUMBIA.EDU with TCP; Tue 31 May 88 21:48:00-EDT Received: from CUVMA.COLUMBIA.EDU(MAILER) by CUVMA.COLUMBIA.EDU(SMTP) ; Tue, 31 May 88 21:48:36 EDT Received: from BITNIC.BITNET by CUVMA.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 4073; Tue, 31 May 88 21:48:35 EDT Received: by BITNIC (Mailer X1.25) id 2182; Tue, 31 May 88 21:43:33 EDT Date: Tue, 31 May 88 18:36:21 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John Kesich Subject: Re: Extended ASCII with Kermit To: Frank da Cruz In-Reply-To: Message of Mon, 30 May 88 11:06:44 +0200 from > That's the way to build a super terminal to display data from a super > text processor that can manage all languages simultaneously. > But how will this processor store its text? Not in a plain 8-bit text > file obviously. > And that's what's I was talking of: transferring to-day's 8-bit files > that store one version of ISO8859 and terminal support for that > one version of code. Let's first agree on how to do that. > File transfer of more elaborate data will have to encode the data for > integrity anyway. So, the ISO scheme can apply only to terminal mode. By today's 8-bit codes I can only assume that you are refering to ECMA registered codes (such as the ISO8859 character sets). Each of these codes has 2 registered designation sequences (G0 and G1 character sets are designated seperately). What I described in my previous note was not a proposed ISO standard but something that has been around since 1973. The only new element in the picture is ISO8859. As I understand it, ISO8859 represents the first set of internationally agreed upon VERSIONS of ISO character sets. ** perhaps someone could post a list of the other ECMA registered character sets ** There is currently limited hardware support for these escape sequences, I can only guess that they are more heavily used in Europe than in America. However, even here the DRCS escape sequence defined in ISO2022 is widely supported (as I mentioned in a previous note). There is at least one word processing package that I know of which makes use of it to provide alternate characters (WordMARC which provides Greek characters and math symbols). However, such programs as Tex, Troff, Script, MacWrite, etc should be able to do the same. (I can't guess at how much effort would be required, but reinventing the ISO escape sequences - and I am sure they would be reinvented - can't be easier.) As far as I am concerned it makes no sense to adopt ISO8859 without the related escape sequences. > By the way, could you describe in a couple of lines how ISO defines > switching between the two halves of a single 8-bit set with SI/SO > for a 7-bit line? The mechanism looks fairly obvious, but I would hate > missing some subtle feature. I don't claim to be an expert, and I hope others will correct any mistakes, but here is my understanding of how it works: There are 3 sets of escape sequences: designator 94 char 96 char invoker single-character-invoker G0 ESC 2/8 f SI G1 ESC 2/9 f ESC 2/13 f SO G2 ESC 2/10 f ESC 2/14 f LS2 SS2 G3 ESC 2/11 f ESC 2/15 f LS3 SS3 where 'f' is the code assigned by ECMA in accordance with ISO2375. and the shift sequences are defined as follows: SO 0/14 (called LS1 in 8-bit environments - I don't know the difference) SI 0/15 (called LS0 in 8-bit environments - I don't know the difference) LS2 ESC 6/14 LS3 ESC 6/15 SS2 ESC 4/14 (8/14 in 8-bit environments) SS3 ESC 4/15 (8/15 in 8-bit environments) So you designate your 4 graphic character sets and then use the various shift (invoker) sequences as needed. For example: ESC 2/8 4/2 designate ISO8859-1 G0 as your G0 ESC 2/13 4/1 designate ISO8859-1 G1 as your G1 ESC 2/14 4/2 designate ISO8859-2 G1 as your G2 SO 4/0 'load' G1 (ISO8859-1 G1) 4/0 6/0 'print' A-grave a-grave SS2 4/0 'load' position 4/0 from G2 (ISO8859-2 G1) 4/0 6/0 'print' R-acute a-grave SI 'load' G0 (ISO8859-1 G0) 4/0 6/0 'print' @ ` (commercial at, grave accent) I hope this will be helpful. 6-Jun-88 10:20:23-EDT,3898;000000000001 Return-Path: <@CUVMA.CC.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.CC.COLUMBIA.EDU by CU20B.CC.COLUMBIA.EDU with TCP; Mon 6 Jun 88 10:20:19-EDT Received: from CUVMA.CC.COLUMBIA.EDU(MAILER) by CUVMA.CC.COLUMBIA.EDU(SMTP) ; Mon, 06 Jun 88 10:21:15 EDT Received: from BITNIC.BITNET by CUVMA.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0666; Mon, 06 Jun 88 10:21:08 EDT Received: by BITNIC (Mailer X1.25) id 3087; Mon, 06 Jun 88 09:37:30 EDT Date: Mon, 6 Jun 88 15:19:00 MET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: What is EBCDIC? (2nd correction) To: Frank da Cruz Dear list subscribers With the help of Mr. Pirard's table I could correct the CP500 column. Here is the second corrected version. Yours faithfully, Johan van Wingen ######################################################################## What is EBCDIC? (2nd Correction) ######################################################################## First, what is EBCDIC? We consider for this moment the basic set with 94 characters only. If we take the yellow card (GX20-1850), we see two columns, one "standard", one for the T-11 and TN chains. Also, there are the GT10 type tables for the IBM 3800 printers. Further, there are national variants, based on "US" and "International", (see IBM3270 Information Display System, Character Set Reference, GA27-2837-9, Figure 10-43). Finally, there are CP037 and CP500, containing extensions. (The 4250 code pages, which are still more different, are left out.) The differences affect only a restricted number of graphics. A compromise between CP037 and CP500 should be possible. ISO Interna My ID NAME 8859-1 US tional TN GT10 CP037 CP500 prop. SM06 left square bracket 5B -- 4A AD AD BA 4A 4A SM08 right square bracket 5D -- 5A BD BD BB 5A 5A SM11 left curly bracket 7B C0 C0 8B 8B C0 C0 C0 SM14 right curly bracket 7D D0 D0 9B 9B D0 D0 D0 SC04 cent sign A2 4A -- 4A 4A 4A B0 BA SP02 exclamation mark 21 5A 4F 5A 5A 5A 4F 6A SM13 vertical line 7C 4F -- 4F 4F 4F BB 4F SM65 broken vertical line A6 6A 6A -- -- 6A 6A BB SM07 reverse solidus (slash) 5C E0 E0 -- E0 E0 E0 E0 SD19 tilde 7E A1 A1 -- -- A1 A1 A1 SD13 grave accent 60 79 79 -- -- 79 79 79 SM66 not sign AC 5F 5F 5F 5F 5F BA 5F SD15 circumflex accent 5E -- -- -- -- B0 5F B0 This is valid except for national variants at some of the 14 codes: 4A 5A 6A 79 5B 7B 7C 5F A1 C0 D0 E0 4F 7F ; following US are: Canadian Bilingual, English (UK), Hebrew, Japanese, Portuguese, Spanish; following International are: German, Belgian, Brazilian, Canadian French, Danish/Norwegian, Finnish/ Swedish, French, Italian, Swiss. The best test case for determining your set is 4F: US/CP037: exclamation mark International/CP500: vertical line As for the extensions, CP037 and CP500 are identical, (TN and GT10 have different extensions). The NOT sign is a separate problem to be discussed later on. ######################################################################## FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 8-Jun-88 06:35:48-EDT,2223;000000000001 Return-Path: <@CUVMA.CC.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.CC.COLUMBIA.EDU by CU20B.CC.COLUMBIA.EDU with TCP; Wed 8 Jun 88 06:35:46-EDT Received: from CUVMA.CC.COLUMBIA.EDU(MAILER) by CUVMA.CC.COLUMBIA.EDU(SMTP) ; Wed, 08 Jun 88 06:36:44 EDT Received: from BITNIC.BITNET by CUVMA.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 3019; Wed, 08 Jun 88 06:36:43 EDT Received: by BITNIC (Mailer X1.25) id 4920; Wed, 08 Jun 88 06:36:38 EDT Date: Wed, 8 Jun 88 12:24:00 MET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: EBCDIC on screen To: Frank da Cruz Dear list subscribers What shows up on your screen when looking at all 256 bytes ("raw EBCDIC") may depend on several things. 1. the hardware of your terminal 2. how the control unit (3174, 3274) has been customized 3. the presence of PS ("programmable storage") 4. the operating system (OS/MVS/TSO or VM/CMS) 5. the option chosen under your editor (with TSO/ISPF/PDF you may use PDF 0.1 setting one of 3278, 3278A, 3278T, 3278CN, 3278KN, each giving a different screen content) It would be helpful to know how some effects on several terminal types may be achieved. I have no idea how to show CP037 on a 3192G. Here it is a MVS-only site. It seems that most of the contributions came from VM/CMS sites, producing very little that I could use directly. With all editing done under ISPF/PDF, it would the best solution to have the GDDM symbol sets for CP037 and CP500 (due to Mr. J. Wilhelm) accessible to ISPF. As these can be easily supplemented by other sets, all code page problems can be solved. As for printers either IEBIMAGE or APA software can realize everything desirable. Can anyone report having experience in doing this? Yours faithfully, Johan van Wingen FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 8-Jun-88 09:38:53-EDT,11965;000000000001 Return-Path: <@CUVMA.CC.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.CC.COLUMBIA.EDU by CU20B.CC.COLUMBIA.EDU with TCP; Wed 8 Jun 88 09:38:42-EDT Received: from CUVMA.CC.COLUMBIA.EDU(MAILER) by CUVMA.CC.COLUMBIA.EDU(SMTP) ; Wed, 08 Jun 88 09:39:38 EDT Received: from BITNIC.BITNET by CUVMA.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 3223; Wed, 08 Jun 88 09:39:35 EDT Received: by BITNIC (Mailer X1.25) id 7377; Wed, 08 Jun 88 09:37:12 EDT Date: Wed, 8 Jun 88 15:23:00 MET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: Notation To: Frank da Cruz Dear list subscribers It seems convenient to have a compact representation of the character content of tables under discussion, enabling an immediate view on their differences. Thus I extended a notation system, such as previously sent, for all characters in ISO8859-1, -2 and -9, (the last one includes Turkish, and is still to be approved). The code used for vertical line is 4F, that for exclamation sign 5A. Yours faithfully, Johan van Wingen A NOTATION SYSTEM FOR LETTERS NOT IN ASCII OR 94-EBCDIC The notation consists of two characters, the first being one of a restricted set of special characters, the second being one out of the common subset of ASCII and 94-EBCDIC. Thus it is suitable for processing by a program to a single byte code. Reference is made to the character identifications (ID) found in ISO6937, (these consist of two letters and two digits). First characters: (descriptions taken from ISO 6937-2, additions between parentheses, numbering system for identification taken from ISO6937-1, p. 7) / acute accent 11,12 \ grave accent 13,14 ^ circumflex accent 15,17 % diaeresis (umlaut, trema) 17,18 ~ tilde 19,20 * caron (hachek) 21,22 # breve (Rumanian a) 23,24 # double acute accent (Hungarian o,u) 25,26 @ ring (above: a,u) 27,28 @ dot (above: z) 29,30 = macron (upper line) 31,32 $ cedilla (c,s,t) 41,42 $ ogonek (Polish a,e) 43,44 $ (barred: o, eth, thorn) 61.... _ (underline, fraction) & (ligature: ae,oe,sz) 51,52 ? (dot below) REGULAR LETTERS AND DECIMAL DIGITS not. ID Name or description a LA01 small a A LA02 capital A : : : z LZ01 small z Z LZ02 capital z 1 ND01 digit one : : : 9 ND09 digit nine 0 ND10 digit zero VOWELS not. ID Name or description /a LA11 small a with acute accent \a LA13 small a with grave accent ^a LA15 small a with circumflex accent %a LA17 small a with diaeresis or umlaut mark ~a LA19 small a with tilde #a LA23 small a with breve @a LA27 small a with ring =a LA31 small a with macron $a LA43 small a with ogonek &a LA51 small ae diphtong /e LE11 small e with acute accent \e LE13 small e with grave accent ^e LE15 small e with circumflex accent %e LE17 small e with diaeresis or umlaut mark *e LE21 small e with caron @e LE29 small e with dot above =e LE31 small e with macron $e LE43 small e with ogonek /i LI11 small i with acute accent \i LI13 small i with grave accent ^i LI15 small i with circumflex accent %i LI17 small i with diaeresis ~i LI19 small i with tilde =i LI31 small i with macron $i LI43 small i with ogonek &i LI51 small ij ligature @i LI61 small i without dot /o LO11 small o with acute accent \o LO13 small o with grave accent ^o LO15 small o with circumflex accent %o LO17 small o with diaeresis or umlaut mark ~o LO19 small o with tilde #o LO25 small o with double acute accent =o LO31 small o with macron &o LO51 small oe ligature $o LO51 small o with slash /u LU11 small u with acute accent \u LU13 small u with grave accent ^u LU15 small u with circumflex accent %u LU17 small u with diaeresis or umlaut mark ~u LU19 small u with tilde #u LU25 small u with double acute accent @u LU27 small u with ring =u LU31 small u with macron $u LU43 small u with ogonek /y LY11 small y with acute accent \y LY13 small y with grave accent ^y LY15 small y with circumflex accent %y LY17 small y with diaeresis or umlaut mark CONSONANTS (ISO8859-1, -2 and -9 only) not. ID Name or description /c LC11 small c with acute accent *c LC21 small c with caron $c LC41 small c with cedilla *d LD21 small d with caron =d LD61 small d with stroke $d LD63 small eth, Icelandic #g LG23 small g with breve /l LL11 small l with acute accent *l LL21 small n with caron $l LL61 small l with stroke /n LN11 small n with acute accent ~n LN19 small n with tilde *n LN21 small l with caron /r LR11 small r with acute accent *r LR21 small r with caron /s LS11 small s with acute accent *s LS21 small s with caron $s LS41 small s with cedilla &s LS61 small sharp s, German $p LT17 small thorn, Icelandic *t LT21 small t with caron $t LT41 small t with cedilla /z LZ11 small z with acute accent *z LZ21 small z with caron @z LZ29 small z with dot above Capital letters have even numbers, odd + 1. But notice the following: i LI01 small i @i LI61 small i without dot I LI02 capital I (without dot) @I LI30 capital I with dot above $D LD62 capital D with stroke, Icelandic eth DIGITS AND NUMBERS not. ID Name or description @1 NS01 superscript one @2 NS02 superscript two @3 NS03 superscript three _2 NF01 fraction one-half _3 NF04 fraction one-quarter _4 NF05 fraction three-quarters SPECIAL CHARACTERS not. ID Name or description =f SC01 general currency sign =L SC02 pound sign $ SC03 dollar sign =c SC04 cent sign =Y SC05 yen ! SP02 exclamation mark *! SP03 inverted exclamation mark " SP04 quotation mark ' SP05 apostrophe ( SP06 left parenthesis ) SP07 right parenthesis , SP08 comma _ SP09 low line - SP10 hyphen or minus sign . SP11 full stop, period / SP12 solidus : SP13 colon ; SP14 semicolon ? SP15 question mark *? SP16 inverted question mark *< SP17 angle quotation mark left *> SP18 angle quotation mark right + SA01 plus sign _+ SA02 plus or minus sign < SA03 less-than sign = SA04 equals sign > SA05 greater-than sign _: SA06 divide sign _* SA07 multiply sign # SM01 number sign % SM02 percent sign & SM03 ampersand * SM04 asterisk @ SM05 commercial at *( SM06 left square bracket \ SM07 reverse solidus *) SM08 right square bracket { SM11 left curly bracket | SM13 vertical line } SM14 right curly bracket #m SM17 micro sign @0 SM19 degree sign _o SM20 ordinal indicator masculine _a SM21 ordinal indicator feminine #S SM24 section sign #p SM25 pilchrow #. SM26 middle dot #c SM52 copyright sign #r SM53 registered sign *| : SM65 broken bar ^ SM66 not sign @/ SD11 acute accent @\ ` SD13 grave accent @^ SD15 circumflex accent @% SD17 diaeresis or umlaut mark @$ ~ SD19 tilde @* SD21 caron @# SD23 breve @" SD25 double acute accent @0 SD27 ring @@ SD29 dot above @= SD31 macron _) SD41 cedilla _( SD42 ogonek NOTE: If necessary, the following characters will denoted as: SP space NB no-break space SH soft hyphen ISO8859-1 ISO8859-2 . 2. 3. 4. 5. 6. 7. A. B. C. D. E. F. . 2. 3. 4. 5. 6. 7. A. B. C. D. E. F. . .0 0 @ P ` p NB @0 \A $D \a $d . 0 @ P ` p NB @0 /R $D /r =d . .1 ! 1 A Q a q *! _+ /A ~N /a ~n . ! 1 A Q a q $A $a /A /N /a /n . .2 " 2 B R b r =c @2 ^A \O ^a \o . " 2 B R b r @# _( ^A *N ^a *n . .3 # 3 C S c s =L @3 ~A /O ~a /o . # 3 C S c s $L $l #A /O #a /o . .4 $ 4 D T d t =f @/ %A ^O %a ^o . $ 4 D T d t =f @/ %A ^O %a ^o . .5 % 5 E U e u =Y #m @A ~O @a ~o . % 5 E U e u *L *l /L #O /l #o . .6 & 6 F V f v *| #p &A %O &a %o . & 6 F V f v /S /s /C %O /c %o . .7 ' 7 G W g w #S #. $C _* $c _: . ' 7 G W g w #S @* $C _* $c _: . .8 ( 8 H X h x @% _) \E $O \e $o . ( 8 H X h x @% _) *C *R *c *r . .9 ) 9 I Y i y #c @1 /E \U /e \u . ) 9 I Y i y *S *s /E @U /e @u . .A * : J Z j z _a _o ^E /U ^e /u . * : J Z j z $S $s $E /U $e /u . .B + ; K *( k { *< *> %E ^U %e ^u . + ; K *( k { *T *t %E #U %e #u . .C , < L \ l | ^ _4 \I %U \i %u . , < L \ l | /Z /z *E %U *e %u . .D - = M *) m } SH _2 /I /Y /i /y . - = M *) m } SH @" /I /Y /i /y . .E . > N @^ n ~ #r _3 ^I $P ^i $p . . > N @^ n ~ *Z *z ^I $T ^i $t . .F / ? O _ o _ @= *? %I /s %i %y . / ? O _ o _ @Z @z *D /s *d @@ . CP037 . CP500 . . 4. 5. 6. 7. 8. 9. A. B. C. D. E. F. . 4. 5. 6. 7. 8. 9. A. B. C. D. E. F. . .0 & - $o $O @0 #m ^ { } \ 0 . & - $o $O @0 #m =c { } \ 0 .1 NS /e / /E a j ~ =L A J _: 1 . NS /e / /E a j ~ =L A J _: 1 .2 ^a ^e ^A ^E b k s =Y B K S 2 . ^a ^e ^A ^E b k s =Y B K S 2 .3 %a %e %A %E c l t #. C L T 3 . %a %e %A %E c l t #. C L T 3 .4 \a \e \A \E d m u #c D M U 4 . \a \e \A \E d m u #c D M U 4 .5 /a /i /A /I e n v #S E N V 5 . /a /i /A /I e n v #S E N V 5 .6 ~a ^i ~A ^I f o w #p F O W 6 . ~a ^i ~A ^I f o w #p F O W 6 .7 @a %i @A %I g p x _4 G P X 7 . @a %i @A %I g p x _4 G P X 7 .8 $c \i $C \I h q y _2 H Q Y 8 . $c \i $C \I h q y _2 H Q Y 8 .9 ~n &s ~N ` i r z _3 I R Z 9 . ~n &s ~N ` i r z _3 I R Z 9 .A =c ! *| : *< _a *! *( SH @1 @2 @3 . *( *) *| : *< _a *! ^ SH @1 @2 @3 .B . $ , # *> _o *? *) ^o ^u ^O ^U . . $ , # *> _o *? | ^o ^u ^O ^U .C < * % @ $d &a $D @= %o %u %O %U . < * % @ $d &a $D @= %o %u %O %U .D ( ) _ ' /y _) /Y @% \o \u \O \U . ( ) _ ' /y _) /Y @% \o \u \O \U .E + ; > = $p &A $P @/ /o /u /O /U . + ; > = $p &A $P @/ /o /u /O /U .F | ^ ? " _+ =f #r _* ~o %y ~O . ! @^ ? " _+ =f #r _* ~o %y ~O . . FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 8-Jun-88 13:59:09-EDT,1972;000000000001 Return-Path: <@CUVMA.CC.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.CC.COLUMBIA.EDU by CU20B.CC.COLUMBIA.EDU with TCP; Wed 8 Jun 88 13:59:06-EDT Received: from CUVMA.CC.COLUMBIA.EDU(MAILER) by CUVMA.CC.COLUMBIA.EDU(SMTP) ; Wed, 08 Jun 88 14:00:03 EDT Received: from BITNIC.BITNET by CUVMA.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 3836; Wed, 08 Jun 88 13:59:58 EDT Received: by BITNIC (Mailer X1.25) id 8680; Wed, 08 Jun 88 13:59:49 EDT Date: Wed, 8 Jun 88 13:37:58 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edwin Hart Subject: ISO 8859-1, -2, -3, . . . -9 Standards and High Level Languages To: Frank da Cruz I am trying to understand two areas: 1. I have copies of ISO 8859-1, through -4. What languages are covered by -5, through -9? 2. Did ISO put any restrictions on High Level (Programming) Languages with respect to the 8859 series of codes? In particular I can think of two possibilities: a. Only characters in common to all ISO 8859 codes are valid in high level languages for operators, etc. From my look at 8859-1 through -4, this means that code points X'20' through X'7F' and multiplication (small x) and division symbols. b. Only characters in ISO 8859-1 are valid for high level languages. The IBM NOT symbol "^" (X'5F' for CP 37 and X'BA' for CP 500) is included in 8859-1 but not in -2, -3, -4. Using 2.b. implies that 8859-1 must be common to terminals in use in several countries where the primary code is -2 through -9. Using 2.a. means that the NOT symbol will be unavailable for programming languages. Thank you for your comments, Ed Hart 8-Jun-88 16:15:54-EDT,2586;000000000001 Return-Path: <@CUVMA.CC.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.CC.COLUMBIA.EDU by CU20B.CC.COLUMBIA.EDU with TCP; Wed 8 Jun 88 16:15:49-EDT Received: from CUVMA.CC.COLUMBIA.EDU(MAILER) by CUVMA.CC.COLUMBIA.EDU(SMTP) ; Wed, 08 Jun 88 16:16:41 EDT Received: from BITNIC.BITNET by CUVMA.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 4203; Wed, 08 Jun 88 16:16:40 EDT Received: by BITNIC (Mailer X1.25) id 2403; Wed, 08 Jun 88 16:15:44 EDT Date: Wed, 8 Jun 88 15:42:44 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John C Klensin Subject: RE: ISO 8859-1, -2, -3, . . . -9 Standards and High Level Languages X-To: ASCII/EBCDIC character set related issues To: Frank da Cruz There has been some discussion about "requiring" the programming languages to make a common statement about codes, but nothing definitive has happened. The strongest statements have been about support for multi-octet character sets, which has nothing to do with ISO8859. There was a strong suggestion several years ago that the languages avoid the use of any character not in the ISO646 Basic Table (i.e., seven-bit graphics with all national use positions excluded), but it mostly resulted in a survey of deviants (almost everyone) and little action. The one noticable consequence of that effort may well be the exclamation mark/vertical bar confusion about which character to use for "or", and the similar tilde/diresis problem with "not". The different programming language standards differ in how they approach character sets. A few say what amounts to "you will use ASCII". At the other extreme, at least one says "use whatever external form you like, as long as there is an abstraction that maps it into...". Some of those approaches are more easily consistent with ISO8859 variations than others. There has been, as far as I know, no serious proposal for a common ISO8859 subset for programming languages. The common subset is an ISO646 subset. Finally, it is very difficult for the working groups in one subcommittee (e.g., character sets and codes) to require the working groups in another subcommittee (e.g., programming languages) to do (or not do) anything. That is just not how the process works. 12-Jun-88 08:24:22-EDT,1957;000000000001 Return-Path: <@CUVMA.CC.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMA.CC.COLUMBIA.EDU by CU20B.CC.COLUMBIA.EDU with TCP; Sun 12 Jun 88 08:24:20-EDT Received: from CUVMA.CC.COLUMBIA.EDU(MAILER) by CUVMA.CC.COLUMBIA.EDU(SMTP) ; Sun, 12 Jun 88 08:23:36 EDT Received: from BITNIC.BITNET by CUVMA.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6094; Fri, 10 Jun 88 12:58:25 EDT Received: by BITNIC (Mailer X1.25) id 5835; Fri, 10 Jun 88 12:58:15 EDT Date: Fri, 10 Jun 88 16:50:00 MET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: a few notes To: Frank da Cruz Dear list subscribers A few little points this time. 1. People who cannot get CP037 and CP500 elsewhere may consult a new IBM publication (Jan. 1988) which just arrived here. It is SC33-0554-00 GDDM Type faces and Shading Patterns. 2. Please correct an error in the CP037 table recently mailed by me. At B0 "^" should be "@^". 3. Mr. Hart is quite right in concluding that the NOT sign only occurs in ISO8859-1. Apparently people in Poland or Hungary are not supposed by ISO/TC97/SC2 to use PL/I, which is not according to the facts. 4. As most programming language standards are much older than ISO8859 (1987), it cannot be expected that these take it into account. There is much more to say about the relation, but that must come at a later moment. 5. A list of all the relevant standards was mailed 24 March, and is contained in LOG8803. Yours faithfully, johan van Wingen FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 15-Jul-88 11:40:15-EDT,3236;000000000001 Return-Path: <@CUVMB.CC.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMB.CC.COLUMBIA.EDU by CU20B.CC.COLUMBIA.EDU with TCP; Fri 15 Jul 88 11:40:12-EDT Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Fri, 15 Jul 88 11:36:26 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 4961; Fri, 15 Jul 88 11:36:22 EDT Received: by BITNIC (Mailer X1.25) id 7043; Fri, 15 Jul 88 11:36:33 EDT Date: Fri, 15 Jul 88 17:12:00 MET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: Code switching To: Frank da Cruz Dear list subscribers My congratulations to Mr. Kesich for his excellent analysis. I sent a copy of it and of Mr Hart's mailing to the SEAS Secretary (DECK@RCSCK11). I hope both of you will not mind. It is strange to see how IBM attitudes are with respect to ISO standards, for one can find IBM people at all key positions in ISO committees. Chairman of ISO/TC97, now ISO/IEC JTC1, is J. Rankine, from IBM. Convener of SC2/WG2 (multibyte characters) is J. Andersen, IBM. In SC2/WG3, (7-8 bit codes) you find Mr. W. F. Bohn, IBM. And this is a far from complete list. As for the problem of switching to another code page, I looked at CP870, which should be the equivalent of ISO 8859-2 (Eastern Europe). I was surprised that it is not identical with the code page you get when you convert ISO 8859-2 with the same translate table as for producing CP037 from ISO 8859-1. This has curious consequences. Suppose you use ISO 2022 for table switching in a 8859 file. Then if you have \a /r the codes for \a and /r are identical, but they produce a different graphic as a result of the shift, ("a" with grave accent is denoted by \a, and "r" with acute accent with /r). Now, if you translate this file to EBCDIC with the customary translate table, all codes are converted accordingly, equal codes remaining equal. But this does not produce a /r any more, when the shift is translated to cause switching from CP037 to CP870. This implies that an extra function is required at translating, to switch also the translate table at finding a shift code. This puts an additional burden to our poor hardware and software. A note for people who complained that ISO does not keep to its own rules when introducing 96-character sets. It appears that there is a Third edition of ISO 2022 (1986-05-01) which differs from the Second (1982-12-15) edition in this respect. I became aware of this only very recently. A few corrections should be made in the tables I sent. In that one headed ISO8859-1 position 7F should be blank, and DF should contain &s, not /s. The table for ISO8859-2 contains the same errors. Yours faithfully, Johan van Wingen FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 15-Jul-88 13:37:31-EDT,3313;000000000001 Return-Path: <@CUVMB.CC.COLUMBIA.EDU:ISO8859@JHUVM.BITNET> Received: from CUVMB.CC.COLUMBIA.EDU by CU20B.CC.COLUMBIA.EDU with TCP; Fri 15 Jul 88 13:37:29-EDT Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Fri, 15 Jul 88 13:33:45 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 5256; Fri, 15 Jul 88 13:33:40 EDT Received: by BITNIC (Mailer X1.25) id 8100; Fri, 15 Jul 88 13:33:23 EDT Date: Fri, 15 Jul 88 10:57:21 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Michael Sperberg-McQueen Subject: IBM and standards (PS/2 code page, ISO8859-2 translation) To: Frank da Cruz Footnote to John Kesich's remarks about IBM and standards: it's not the programs I care about (any program that hard codes non-standard character set extensions is asking for everything it gets), but the users' data. You don't really think PC users should have to convert all of their data with non-ASCII characters when they move from PCs to PSs, do you? You don't really think they would, even if they were supposed to, do you? I don't want the phone calls I would get if IBM had moved the umlauts to their ISO 8859-1 positions. And you don't either. Yes, it would have been nice for the PC to have had a rational extension of ASCII instead of the mess it actually has. Yes, it would be nice if the PS/2 could switch code pages in flight. Yes, it would have been nice for IBM to adhere fully to ISO 8859-1. But given the original PC character set, given the hardware the PS/2 actually has, and given IBM's unwillingness to make users eat data conversion costs even for their own good, the PS/2 code page does look (to me) like a step forward. Let's rejoice: it's not often we see even small steps moving in the right direction. Footnote to J. W. van Wingen's remark about ISO 8859-2 translation: doesn't IBM's decision to avoid data conversion problems by retaining national versions of the extended EBCDIC code pages imply, already and by itself, the impossibility of using the same translate table for the various parts of ISO8859? I agree it's a shame, it is a rather large step in the wrong direction. But is it a surprise? At least part of the mapping must be determined by the existing EBCDICs for Greece, Israel, and so on. What firm wants to impose immense data conversion costs on whole countries of users, if they can avoid them by fouling up the EBCDIC/ASCII translation problem a little bit more? (I don't mean just IBM -- I've seen ASCII machines screw up the translation too.) It's depressing to think how long lists like this one are going to be necessary. Our poor hardware and software are going to continue to be strained. (By the way, thanks to JWvW for acknowledging that ISO 2022 had changed recently on the 94/96 character issue. I thought I was going crazy.) All the above pessimism is my own and not the official policy of my employer. Michael Sperberg-McQueen, University of Illinois at Chicago 31-Aug-88 2:52:45-GMT,3038;000000000001 Received: from CU20B.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA02759; Tue, 30 Aug 88 22:52:39 EDT Received: from CUVMB.CC.COLUMBIA.EDU by CU20B.CC.COLUMBIA.EDU with TCP; Tue 30 Aug 88 22:51:54-EDT Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Tue, 30 Aug 88 22:51:17 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 7711; Tue, 30 Aug 88 22:51:16 EDT Received: by BITNIC (Mailer X1.25) id 7025; Tue, 30 Aug 88 22:53:09 EDT Date: Tue, 30 Aug 88 19:49:00 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Richard Subject: Re: SHARE White Paper To: Frank da Cruz >>IBM was following an international standard to our benefit. If this were true no mapping would be needed. Do you receive money from these people? >>If you do not like the character set, your complaint is with ISO--not IBM ISO8859-1 fills the need for a multi-lingual Latin character set very well. Many people at many installations including this one will be happy with it. Perhaps most people in some W. European countries will be happy with it. I have no complaints with ISO. However, I get the impression that some members of this list believe this character set is suitable for *general* use in North America. Have I misunderstood? >>IBM finally gave us the potential to have a 1-to-1 mapping between >>CECPs 37 and 500 and an 8-bit "ASCII" ... It would be an excellent achievement if this effort produced an 8 bit ASCII-to-EBCDIC conversion that gained as widespread use as the most popular of the current 7 bit tables. Code page 500 IS *NOT* just as much EBCDIC as code page 37 is, because the former is inlikely to gain widespread use. The latter has some chance although there will be a game of musical brackets for the next decade. So whats a thousand wasted man years? >>Special characters, like the ones needed for publishing, >>will require code page switching. Agree. What I would like to minimize is the number of translate tables tables visible to the user. Preferably just one for most users. >>If you look at Code Page 500 and the "Standard" ASCII-to-EBCDIC conversions >>you will discover that code page 500 maps very will into 7-bit ASCII. Does your installation use "Standard" ASCII-to-EBCDIC conversions to attach ASCII terminals to your EBCDIC mainframe? I can think of 3 printable things to say about this table: - Some people use it. - Most people do not use it. - It has produced F.U.D. Having 2 Code Pages for ISO8859-1 has produced F.U.D. Moving Brackets from where IBM once recommended has produced F.U.D. Moving Braces from where IBM once recommended has produced 15 years of F.U.D. With no obvious explanation for these changes, one begins to suspect that their only purpose is the production of F.U.D. 31-Aug-88 3:59:43-GMT,1785;000000000001 Received: from CU20B.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA06584; Tue, 30 Aug 88 23:59:40 EDT Received: from CUVMB.CC.COLUMBIA.EDU by CU20B.CC.COLUMBIA.EDU with TCP; Tue 30 Aug 88 23:58:56-EDT Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Tue, 30 Aug 88 23:58:20 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 7745; Tue, 30 Aug 88 23:58:19 EDT Received: by BITNIC (Mailer X1.25) id 7513; Wed, 31 Aug 88 00:00:15 EDT Date: Tue, 30 Aug 88 23:38:03 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John C Klensin Subject: RE: Re: SHARE White Paper X-To: ASCII/EBCDIC character set related issues To: Frank da Cruz For English and French-speaking North American purposes, since ISO8859-1 is a superset of ASCII (with all of the leading-zero-bit characters in the same positions as the ASCII seven-bit characters), and contains, as far as I know, an adequate set of non-ASCII characters (diacritical markings, etc) to represent French, there appears to be no reason why it should not be adopted for most general use when: - an eight bit set can be handled and processed and - an ANSI/ISO character set is to be used, rather than, e.g., EBCDIC. So, yes, it has been assumed that, in these contexts, 8859-1 is suitable for general USA and Canadian use. I lack the experience to know whether it would be adequate/suitable for general Mexican use, so can't make a general statement about North America. 31-Aug-88 17:00:15-GMT,5256;000000000001 Received: from CU20B.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA02848; Wed, 31 Aug 88 12:59:57 EDT Received: from CUVMB.CC.COLUMBIA.EDU by CU20B.CC.COLUMBIA.EDU with TCP; Wed 31 Aug 88 12:59:11-EDT Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Wed, 31 Aug 88 12:58:34 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 8301; Wed, 31 Aug 88 12:58:32 EDT Received: by BITNIC (Mailer X1.25) id 4246; Wed, 31 Aug 88 12:59:51 EDT Date: Wed, 31 Aug 88 09:40:27 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edwin Hart Subject: Re: SHARE White Paper To: Frank da Cruz In-Reply-To: Your message of Tue, 30 Aug 88 19:49:00 CDT I am NOT getting paid by IBM to work in this area. They could not affort it. In fact, my manager and wife and children have many ideas about how I could better use the time I am spending on this effort. 90% of my week at SHARE was devoted to these issues rather than attending other sessions which would have benefited my installation more (in the short term). I am working in this area because I too am angry and frustrated. We have tried to fix this problem for the last 15 years and have been unsuccessful. Do I like IBM's failure to make a decision on one EBCDIC for Latin Alphabet Number 1? NO!| (If you are using CP 500 you would see "|!".) In fact, IBM has taken an internal IBM argument and made it into an international argument by encouraging code page 500 adoption in Belgium and Switzerland. My words for that are unprintable. Such actions are irresponsible and show no regard for customers. Code page 500 is being widely used here and abroad--especially in Belgium and Switzerland. One of the members of the SHARE committee was using it in the U.S. on 3274s just to avoid the ASCII-EBCDIC translate issues. Messages from Belgian installations on EARN have been coded in code page 500. A recent message to me stated that code page 37 was not up for consideration and that IBM documentation (and he quoted it) said to use code page 500 for international operations. I believe he was from Germany. Just weeks ago, I talked to an IBM contact in the Corporate standards area. He said that IBM as a Corporation (meaning all of the IBM development Divisions) has NOT decided on one EBCDIC code page for ISO Latin Alphabet Number 1. Concerning changing the code points for brackets from the TN/T11 print train, if we have a requirement to move them back, we can certainly tell IBM that. (We have, in fact, drafted such a requirement.) Part of the problem is that IBM is too big and that too many people within IBM have no idea of the problems. The other part is that to correct the problems requires changes to too many IBM products, and requires customers to convert a lot of data and programs. IBM understands big customers. When US multinational banks, insurance companies, manufacturers, oil companies start telling IBM to fix the problems, IBM might put MORE resources into the effort than they have now. The intent of the SHARE paper is to have it approved as a SHAREwide position paper. In other words, to get approval from the 30 SHARE managers who come from not only the Universities but also Commercial accounts. As a part time effort, it will be 9 to 12 months before we can obtain SHAREwide status. If you want to contribute to the effort you are welcome. But if you are angry, take it out on IBM--not me. I need some help with the discussions at the SHARE meetings. I need feedback on the content of the paper. Is the paper clear? Does it read smoothly? Does it really describe all of the problems? Where are the mistakes in the paper? Solutions to the problems are going to cost IBM millions (billions?) of dollars, what business justification do we have to convince IBM to spend that kind of money? If you were faced with solving all of the problems, it would be easier to stick your head in the sand that to try to find a solution. It will take an IBM Corporate commitment to solve the problems. One IBM Division cannot do it alone. Right now, the IBM Corporation has a tremondous dollar commitment to Systems Applications Architecture. SAA is the right target for this effort. We have a window of opportunity. If the paper could have been handed over to IBM as a SHAREwide position paper in August, we could have had an earlier and harder impact. We have IBM people who work with the Committee. They are commited to working for solutions. Material from SHARE European Association (SEAS) and the SHARE white paper effort have already been used to influence IBM decision makers to commit resources to solve the problems. If you come to the SHARE meeting, use some restraint when talking to the IBM people who are listening and trying to help us. They are people just like you and I. They will get very defensive if you start screaming at them. We have a good working relationship and I do not want to ruin it this far into the effort. Ed Hart 2-Sep-88 12:54:15-GMT,2369;000000000001 Received: from CU20B.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA28559; Fri, 2 Sep 88 08:54:12 EDT Received: from CUVMB.CC.COLUMBIA.EDU by CU20B.CC.COLUMBIA.EDU with TCP; Fri 2 Sep 88 08:53:26-EDT Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Fri, 02 Sep 88 08:47:33 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0170; Fri, 02 Sep 88 08:47:31 EDT Received: by BITNIC (Mailer X1.25) id 7006; Fri, 02 Sep 88 08:45:44 EDT Date: Thu, 1 Sep 88 16:49:00 MET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: SC2 meeting To: Frank da Cruz Dear list subscribers The paper by Mr. Hart is a very valuable contribution to the character code discussion with IBM. But we should not forget that also ISO codes have their imperfections. ISO standards are not conceived in a ivory tower, and development can be influenced. The responsible subcommittee ISO/IEC JTC1/SC2 will meet in the week of 17 October in London. Attendance is restricted to national delegates. You can be one, if you be appointed by your national standards institute (the money for the trip you have to provide yourself, mostly). If there is a national SC2 for your country, they will nominate. But, often they are in want for people knowing the matter, who are also prepared to do some work. So, contact your NSI, ask for the names of the people charged with character codes and tell them your interests. If you are successful we'll see each other in London, if not, explain to me your ideas, perhaps I can do something with it. (For US citizens, ANSI rules are different, you have to be backed by some organization.) Should you not know where your NSI is, contact me. I have just completed the first draft of a paper for SC22 and SC2, on Coded Character Sets and Programming Languages. Its 800 lines will be sent to Mr. Hart, to be ordered on request from JHUVM, for your comment. Yours faithfully, Johan van Wingen FROM J. W. van Wingen MOSGLA@HLERUL2 : Mail to : P. O. Box 486, 2300AL Leiden, Netherlands : 9-Sep-88 14:19:35-GMT,3809;000000000001 Received: from CU20B.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA26215; Fri, 9 Sep 88 10:19:19 EDT Received: from CUVMB.CC.COLUMBIA.EDU by CU20B.CC.COLUMBIA.EDU with TCP; Fri 9 Sep 88 10:28:00-EDT Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Fri, 09 Sep 88 10:26:38 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 5539; Fri, 09 Sep 88 10:26:37 EDT Received: by BITNIC (Mailer X1.25) id 4428; Fri, 09 Sep 88 10:28:30 EDT Date: Fri, 9 Sep 88 12:12:11 +0200 Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Andre PIRARD Subject: Re: CP 37 vs CP 500 vs ? To: Frank da Cruz In-Reply-To: Message of Wed, 7 Sep 88 17:18:52 EST from >What is BITNET's position on all this - have they chosen an official BITNET transfers files without conversion and ignores codes issues apparently, being mostly EBCDIC-EBCDIC or restricted to understanding notes. But the problem *had* to be solved when the data goes through an ASCII-EBCDIC gateway. These, to the best of my knowledge, converged to translating ASCII to "that" EBCDIC which is Edwin's proposition: CECP 037 with brackets at AD,BD. But they also convert ASCII circumflex 5E to EBCDIC 5F which is the not sign in 037. Much EBCDIC data is consequently stored on BITNET servers in this code. These gateways have thus established a de-facto standard. Pressing them to change their translation to respect that of the graphics "circumflex" and "not sign" would certainly cause acute problems. On the other hand, when (if?) a BITNET standard code will exist, it would be most welcome these gateways perform 8-bit translation of this code to ISO8859. The other side is 8-bit mostly, isn't it? And while we are at it, why not ask them to implement a "no conversion" feature that would be specified in the RFC header? Will other versions of ISO8859 and their corresponding EBCDICs (one each :-) ) be defined so that they use the same translation? To be specific, I'd like to make sure the modified CECP 037 is as follows: - 037 brackets are moved from BA to AD and BB to BD. - the displaced characters conversely move from AD to BA and BD to BB. - thus we have ISO-CECP 037' conversion 5B-AD 5D-BD DD-BA A8-BB. ? will "circumflex" and "not sign" still be 5E-B0 AC-5F (1) or as per the gateways 5E-5F and consequently AC-B0 ? (2) I could not find a better solution than to implement (1) for terminal mode and (2) for file transfer. (2) for terminal mode would impair either ISO8859 or CECP037 and worse, ASCII or EBCDIC. Right? Any comment? > 1) code page translation can be implemented automatically and > transparently when data traverses certain RSCS links IBM says so, but it is not that easy. Translating requires both source and destination codes to be known, binary being a special case where the table is null translation. If a receiver cares to indicate its codepage, the only way to know the source one is to have the sender tag the file accordingly. Many can be expected not to care for that. A by-site code tag to be added to the BITNET tables could be imagined, but one still has to know if the file is binary or text and really that installation's code. No, I think this would add to the problem of knowing what was the source code that of also knowing what translation RSCS did use and keep us checking files codes forever. And the tagging of files can take longer to install than using a common code, which is a better long term solution towards everyone's promised peace of mind. Andr). 9-Sep-88 19:44:39-GMT,1699;000000000001 Received: from CU20B.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA18632; Fri, 9 Sep 88 15:44:28 EDT Received: from CUVMB.CC.COLUMBIA.EDU by CU20B.CC.COLUMBIA.EDU with TCP; Fri 9 Sep 88 15:53:27-EDT Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Fri, 09 Sep 88 15:52:01 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6067; Fri, 09 Sep 88 15:52:00 EDT Received: by BITNIC (Mailer X1.25) id 1166; Fri, 09 Sep 88 15:52:02 EDT Date: Fri, 9 Sep 88 15:14:26 EDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Roger Fajman Subject: Re: CP37 code assignments To: Frank da Cruz Getting any given program to accept multiple code points for brackets or whatever characters is only half the battle. It also has to be able to produce output that will be readable on the devices you are using. How would you like to read a listing of a C program in which all the brackets and braces printed as blanks? As for BITNET, it seems to me that the network should pick a single standard code page for EBCDIC text files and nodes should be encouraged to translate whatever local code page they are using to that standard as the file is transmitted. The receiver can then translate it to whatever they use. I don't think that intermediate nodes should be performing translations. At this point, however, it seems best to see what IBM does before making a decision about which code page to use. 10-Sep-88 8:29:24-GMT,4925;000000000001 Received: from CU20B.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA22026; Sat, 10 Sep 88 04:29:20 EDT Received: from CUVMB.CC.COLUMBIA.EDU by CU20B.CC.COLUMBIA.EDU with TCP; Sat 10 Sep 88 04:29:12-EDT Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Sat, 10 Sep 88 04:27:49 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6437; Sat, 10 Sep 88 04:27:47 EDT Received: by BITNIC (Mailer X1.25) id 6492; Sat, 10 Sep 88 04:29:38 EDT Date: Sat, 10 Sep 88 03:20:00 CDT Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Richard Subject: Re: CP 37 vs CP 500 vs ? To: Frank da Cruz > - thus we have ISO-CECP 037' conversion 5B-AD 5D-BD DD-BA A8-BB. > ? will "circumflex" and "not sign" still be 5E-B0 AC-5F (1) > or as per the gateways 5E-5F and consequently AC-B0 ? (2) > > I could not find a better solution than to implement (1) for terminal mode > and (2) for file transfer. (2) for terminal mode would impair either ISO8859 > or CECP037 and worse, ASCII or EBCDIC. A better solution is to use (2) for both terminal mode and for file transfer. This has the effect of interchanging "not" and "circumflex" in CP 37 so that it would agree with the current de facto standard. This standard is widely used in software that supports ASCII terminals such as Waterloo Script, Kermit, and SAS. We rarely have to make mods to software. Following existing standards is the easiest way to get a "Standard" accepted, since it already has been accepted. No changes to gateways or languages such as PLI would be needed. Existing EBCDIC devices would display a "not sign" instead of a "circumflex" unless they were changed. I believe that it is the nature of standards to evolve unless a lot of work is done to invent new standards that do not follow the emerging one. Such work can cause many years of problems. A good example is the original "bit mapped" ASCII keyboard which took decades to finally expire. IBM's decision to ignore both there own, and SHARE's recommmendations for the position of "braces" is another example. In this case it was the original standard that died. The point is that it takes at least a decade of "evolution" to resolve the confusion so caused. The reason for the current relatively stable ASCII/EBCDIC standard is that many years have elapsed since an existing standard was ignored. I am not counting the new "Standard" in the VS Fortran manual since it is too silly to be a threat. However CP 37 as it exists is very much a threat. The lower half is only wrong in 3 places. This table could very well become the next standard but only after another decade of confusion. This confusion would not be "to our benefit." However a modified CP 37 Version 2 as suggested above would become a standard almost overnight. One is tempted to wonder why, each time a standard evolves, a monkey wrench is then thrown? I don't believe the excuses that have been suggested - IBM is not aware of the problem or is too big or cannot afford to follow standards. Following standards is much easier than ignoring them. Who cares where the brackets and braces and circumflex, tilde, vertical, etc. are, as long as they don't dance around. This battle of the brackets may be a minor skermish in the war between ASCII and EBCDIC. The eventual winner of this war will be the one supported by the most software, much like the VCR war between BETA and VHS. It has been suggested that it is easy to modify software. This is true for software that has no explicit support for ASCII terminals. I once installed a version of APL from Yale that had more translate tables than I could count. Some in TCAM. Some in APL itself. Some in "auxillary processors". Data often went through more than one table serially, so that changes to one table required changes to others. Unless there is a standard translate it is a mistake to use ASCII terminals on corporate mainframes. Dedicated ASCII terminals seem to be going the way of dedicated word processors - to be replaced with micros. The latter are able to emulate either ASCII or EBCDIC terminals. Ten more years of confusion will discourage developers from including ASCII support in their software. This current effort by SHARE to create a new "Standard" is likely to delay the evolution of a standard since it is seeking support from an organization that doesnt want a standard to evolve. Neither SHARE nor BITNET could impose a standard without the help of a rich organization. How about the US military? If they can make COBOL a standard, then this should be trivial! Sorry for the overall negative tone. I hope a few positive notes crept through. Date: Fri, 4 Nov 88 20:23:43 GMT Sender: ASCII/EBCDIC character set related issues From: "Matthias Melcher +49 6221 5645-23,-01" <$28@DHDURZ1> Subject: Code Page 037 vs. 500 To: Frank da Cruz Today we received IBM's answer to our formal enquiry about their recommendation concerning the choice between code page 500 and 037: "The statement of Mr. Hart is correct that IBM has not decided on a single code page with the character repertoire for the international standard ISO 8859-1. This would not be within the meaning of the CECP (Country Extended Code Page) concept either, which is supposed to enable the user to undisturbedly migrate from the current to the extended character repertoire of the respective EBCDIC version in use. This does not alter the fact, however, that IBM declared the table 500 the "strategic" code page and - as correctly quoted by Mr. Melcher from the CECP announcement - recommend it for international applications. The decision which table to use is always up to the user himself. Regarding the internationality, however, of the network under discussion I would consider it false and short-sighted to prefer a national version of EBCDIC (even if it is the American one) to the international version. This is especially true because also non-EBCDIC oriented devices and systems will be connected to the network (7- and 8-bit ASCII). A one-to-one correspondence of the characters of 7-bit ASCII to the restricted EBCDIC, for instance, is given only when using table 500." (Wilhelm Friedrich Bohn, National Requirements and Standards, IBM Headquarter, Stuttgart, Germany) 11-Nov-88 22:13:58-GMT,1851;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA11215; Fri, 11 Nov 88 17:13:41 EST Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Fri, 11 Nov 88 17:14:28 EDT Received: from PSUVM.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 5046; Fri, 11 Nov 88 17:14:26 EDT Received: by PSUVM (Mailer X1.25) id 5545; Fri, 11 Nov 88 17:07:53 EST Date: Fri, 11 Nov 88 13:23:35 CST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Rick Troth Subject: Re: Code Page 037 vs. 500 To: Frank da Cruz In-Reply-To: Message of Fri, 4 Nov 88 20:23:43 GMT from <$28@DHDURZ1> That code page 500 is "strategic" conflicts with the CECP objective of "undisturbed" migration because: all of the gateways on this international network (that translate EBCDIC to ASCII and vice versa) conform to an EBCDIC code page incompatible with CP 500. How can you "undistrubedly" migrate when everything current is written for a not-compatible code set? But obviously Code Page 037 is not satisfactory either, although it does conform to most existing compilers (except C, TeX, SAS, etc). Ed, what do do we call a modified CP 037? "Network EBCDIC" ? "Code Page 037-M" ? And if we just swap the brackets from CP 37 to their Kermit/WiscNet/7171 points, what do we do about the circumflex -vs- logical not problem? "Truth is truth" - Rick Troth Louis Gossett, Jr. TAMCBA VM Operations "Enemy Mine" Texas A&M College of Business 12-Nov-88 3:19:46-GMT,2265;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA05421; Fri, 11 Nov 88 22:19:43 EST Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Fri, 11 Nov 88 22:20:33 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 5301; Fri, 11 Nov 88 22:20:31 EDT Received: by BITNIC (Mailer X1.25) id 1093; Fri, 11 Nov 88 22:19:50 EST Date: Fri, 11 Nov 88 21:39:41 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edwin Hart Subject: Re: Code Page 037 vs. 500 To: Frank da Cruz In-Reply-To: Your message of Fri, 11 Nov 88 13:23:35 CST In the SHARE paper we are asking for one "Reference EBCDIC", it may be code page 500 v1 or code page 37 v1 (or v2 with the brackets at X'AD' and X'BD'. I do not know. I get conflicting requirements from discussions. 1. Some say give me one standard, I don't care what it is as long as you support it. (I would include compiler support here.) I will convert once, but don't ever ask met to do it again. 2. Others say give me one standard but make it code page 37 with the brackets in the TN/T11 code points. 3. Many Europeans have already converted to code page 500, they do not want to convert to code page 37. I can't blame them. However, they are already 99% there with code page 500. (characters at 7 code points are shuffled around between CP 37 and CP 500). For the logical not problem, I see requirements for: 1. a utility to help you convert data and programs from many different code pages into the new "Reference EBCDIC and ASCII" code pages. I also want it to be able to map both the EBCDIC not (^) and circumflex (5) into the logical NOT. I also want both vertical bar (|) and split vertical bar (:) to map into the logical OR. Some also want the tilde to map into the logical NOT. 2. compilers to recognize (possibly as a user invoked option) these kinds of relationships What do you think? Ed Hart 17-Nov-88 1:15:26-GMT,3794;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA23325; Wed, 16 Nov 88 20:15:09 EST Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Wed, 16 Nov 88 20:15:13 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 1534; Wed, 16 Nov 88 20:15:10 EDT Received: by BITNIC (Mailer X1.25) id 3262; Wed, 16 Nov 88 20:14:30 EST Date: Wed, 16 Nov 88 19:35:58 +0100 Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Andre' PIRARD Subject: Re: Code Page 037 vs. 500 To: Frank da Cruz In-Reply-To: Message of Fri, 4 Nov 88 20:23:43 GMT from <$28@DHDURZ1> >Today we received IBM's answer to our formal enquiry about their >recommendation concerning the choice between code page 500 and 037: > [text deleted] >enable the user to undisturbedly migrate from the current to the >extended character repertoire of the respective EBCDIC version in use. > >This does not alter the fact, however, that IBM declared the table 500 >the "strategic" code page and - as correctly quoted by Mr. Melcher >from the CECP announcement - recommend it for international >applications. > >The decision which table to use is always up to the user himself. Before CECP's, we (Belgian) decided of an ASCII/EBCDIC conversion table when we started to use communication software. This was in fact removing a historical mod and adopting the standard VM tables. From then on, every software in sight, the 7171's, and joining BITNET were telling us we had made the right move. Having two kinds of 3270 terminals and some funny printers was considered as the inevitable result of the chaos of the computing world. Without our accented letters, we were used to restrictions and didn't care much as long as a capital A was a capital A. That the hardware finally supporting these accented letters was of that strange kind too was discovered by chance and it came as a shock when I did. Only later could we raise a discussion with IBM and hear the same tune as the one quoted above and of the existence of 037. Only later did I learn from BITNET that our code is called a ghost name 037 v2 and that many people love ghosts. What undisturbed migration means to us is that I had to start CECP 500 support on our 7171's and file transfer in addition to 037 v2. That it took a long time, catalysed a serious bug in the PC (KEYB losing interrupts) and an annoying feature of the 7171 in APL mode (refreshing the end of line on overstrike) and raises embarrassing questions from our users. >Regarding the internationality, however, of the network under discussion >I would consider it false and short-sighted to prefer a national version >of EBCDIC (even if it is the American one) to the international version. I *am* short-sighted, but still can tell an exclamation mark instead of a vertical bar in a VM help screen. Why does such a strategic code lack a decent font on our 3812? Is all that software really going to be converted or is GDDM be a CECP 500 to 037 translator for European use only? >This is especially true because also non-EBCDIC oriented devices and >systems will be connected to the network (7- and 8-bit ASCII). A >one-to-one correspondence of the characters of 7-bit ASCII to the >restricted EBCDIC, for instance, is given only when using table 500." I can translate any PC code page to any CECP for the simple reason that IBM has defined translation between all these tables and ISO8859 and claims it's the rule of the game of future communication. And I applaud this point. Andr). 17-Nov-88 1:24:46-GMT,5138;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA24689; Wed, 16 Nov 88 20:24:42 EST Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Wed, 16 Nov 88 20:24:47 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 1544; Wed, 16 Nov 88 20:24:45 EDT Received: by BITNIC (Mailer X1.25) id 3501; Wed, 16 Nov 88 20:24:11 EST Date: Wed, 16 Nov 88 17:38:08 +0100 Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Andre' PIRARD Subject: Re: Code Page 037 vs. 500 To: Frank da Cruz In-Reply-To: Message of Mon, 14 Nov 88 09:18:18 EST from Two problems of codes are discussed on this list. The first is the meaning of computer encoded data and is an intricate one indeed. And I totally aggree that software must be designed to be code independent. The second is that our networks are text based and do not transport data in an undisturbed way, just because of of a lack of agreement in the field of the first problem. What should be a simple matter is hidden behind a puzzle, despite a simple evident solution. Either code A and B do not represent the same objects and there is no meaning attached to transcoding one to the other. Or they do and there is no theorical reason more than one should exist. Practically however, at least ASCII and EBCDIC exist for 7-bits codes and a host of looking alike ones for 8-bit extensions. To cope with N+1 codes that can fully transcode one to the other (or whose similarity is such that an approximate transcoding can be defined and accepted), a common single reference code is needed and should be the favoured vehicle. Else, every user of one of these must cope with N translate tables (and find out which to use) instead of a single one. This amounts to a total of N! tables pairs (some will write N| :-) instead of N. This is what I call each minding his own business. I would have reached more than 100000 otherwise. At least, there should be no ambiguity as to the code used on a given data path in terms of translation to the reference one, and, by extension, to any other, when, for any reason, a code other than the chosen reference one is used. This means the transcodings by gateways should be coherent and the data unchanged when it returns to a path using the same code. Given that, anyone will be able to peacefully send his C source file to anyone. Transporting data that cannot meaningfully transcode to the reference code would suggest that a so-called binary mode should be implemented, that is no translation across gateways, in fact the less ambiguous translation. But if we have agreed upon the first point, we know that user B will receive unmodified data from user A as long as both ends communication lines use the same code. If they don't and the data is meant to be usable on the receiving system, two codes exist to represent the data. These two codes should translate the same way as would data of the reference code. This is the second rule of the game. Of course, 8-bit codes will travel easily only on 8-bit paths, but I think 8-bit communication is one thing easily at hand. 8-bit codes are limited, but just as 8-bit I/O chips are still used with 32-bit processors I see no near future for wider. So, we must admit a single 8-bit code will not cover all needs. The various ISO8859 versions are the most obvious example and it makes me sorry to see that we are forced to repeat with 8-bit a cause of the present problem, which was tucking different codes on 7-bit. But this time, we are limited by hardware. So, unless code switching techniques are used (exactly what the extension to 8-bit is trying to avoid), the x of ISO8859-x will be a tag of the data. But the data lines will be independent of x. The problem of deciding of wich translate tables to use is complicated by the fact that it must be discussed in terms of practical existing codes and will favour one instead of the other. And this raises religion wars, apparently only in the EBCDIC world. Ironically enough because of a 7-bit problem only and a handful of code points. ISO8859 does not look like controversed in the ASCII world and is what I take as the reference one. But is must be emphasised that all devices loose their ASCII label when considered with an 8-bit point of view. An IBM PC, Macintosh or whatever must translate its code to ISO8859 on the communication line, be it for text file transfer or terminal mode. This requires that a precise translate table be defined for the 256 code points. This has been done by IBM, but I have been unable to obtain that from Apple. I did not even try others. If anyone knows any, I am much interested. All this with the best of my limited knowledge of a code called English. But this is another problem that was forced upon us by the ages. Let us at least make simple what we can invent. Andr). 23-Nov-88 5:19:08-GMT,1710;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA00983; Wed, 23 Nov 88 00:19:04 EST Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Wed, 23 Nov 88 00:08:06 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0162; Wed, 23 Nov 88 00:08:05 EDT Received: by BITNIC (Mailer X1.25) id 2703; Wed, 23 Nov 88 00:07:59 EST Date: Tue, 22 Nov 88 12:13:20 CST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Michael Sperberg-McQueen Subject: TCP/IP support for ISO8859? To: Frank da Cruz When I asked this question on the TCP list, no one answered, either because they thought it uninteresting, or because they didn't know the answer, or because they didn't understand the question. This list ought at least to understand the question. I am using IBM's VM and PC TCP/IP products to Telnet into our 3081 running VM/CMS as a virtual 3270, across an Ethernet. The connection is clearly an eight-bit connection, and the code points not assigned by 94-character EBCDIC are being mapped into the eight-bit extended ASCII of my PS/2. The question: does anyone know where this mapping is being done, or what is needed to customize it (or, I should say, to correct it) so that it maps correctly from the PS/2 modification of ISO8859-1 to one or the other of the extended EBCDIC code pages? Many thanks for any hints or tips. -Michael Sperberg-McQueen University of Illinois at Chicago 23-Nov-88 5:19:12-GMT,1918;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA00987; Wed, 23 Nov 88 00:19:07 EST Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Wed, 23 Nov 88 00:13:13 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0170; Wed, 23 Nov 88 00:13:12 EDT Received: by BITNIC (Mailer X1.25) id 2958; Wed, 23 Nov 88 00:12:57 EST Date: Tue, 22 Nov 88 18:19:46 +0100 Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Andre' PIRARD Subject: Translation of ASCII DEL To: Frank da Cruz A discussion with John Chandler raised a question. I modified his Kermit 370 tables as per the tables I got from IBM as the official ones and I once sent to this list to have Kermit translate 037 v2 to ISO8859 by default. Modifying IBM's 037 to be 037 v2 and applying them to Kermit's was only extending the Kermit tables. Except that the Ascii DEL was now translated to EBCDIC FF instead of the former 07 which is labeled as DEL in the EBCDIC chart. What happens is that, in addition to PC graphic symbols, IBM tucked two characters in the ISO 80-9F unassigned range: 9Fa=07e=Florin sign LI61 9Ea=0Ae=i dotless small SC07 Other control codes that Kermit used to translate to nulls now have a definition for a graphic in that range. I see something good in the IBM tables. They are defined for all the 256 code points and are revertible. I take it as IBM strictest right to define a translation of its 32 additional Ctl characters to the 32 undefined ones of ISO8859 for that sake. Why they chose not to assign the florin sign to FF, I don't know. A Florin is a Gulden, isn't it Johan? Any idea? Andr). 23-Nov-88 5:19:06-GMT,1231;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA00979; Wed, 23 Nov 88 00:19:01 EST Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Wed, 23 Nov 88 00:03:47 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0160; Wed, 23 Nov 88 00:03:45 EDT Received: by BITNIC (Mailer X1.25) id 2554; Wed, 23 Nov 88 00:03:49 EST Date: Tue, 22 Nov 88 15:38:09 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edwin Hart Subject: Re: TCP/IP support for ISO8859? To: Frank da Cruz In-Reply-To: Your message of Tue, 22 Nov 88 12:13:20 CST IBM VM TCP/IP connection to PC and translations. You might try using code page switching with DOS 3.3 or 4.1 to use code page 850 which contains all of the characters of ISO 8859-1. I am using it successfully with the IBM PC 3270 Emulation Program Version 3.03 to transmit files back and forth over a 3270 coax connection to Code Page 37, v1 on the mainframe. Ed Hart 29-Nov-88 2:41:21-GMT,2646;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA08871; Mon, 28 Nov 88 21:41:12 EST Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Mon, 28 Nov 88 21:40:57 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 5924; Mon, 28 Nov 88 21:40:55 EDT Received: by BITNIC (Mailer X1.25) id 4552; Mon, 28 Nov 88 21:36:54 EST Date: Wed, 23 Nov 88 12:43:09 CST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Rick Troth Subject: Re: Code Page 037 vs. 500 To: Frank da Cruz In-Reply-To: Message of Fri, 11 Nov 88 21:39:41 EST from Ed, you answered my questions concisely. Thank you. I have put off replying because you asked what I think and that means I will have to stop and do so. I suggested to the networking group that Texas A&M adopt Code Page 37 and was met with "hem ... haw" response. So I waited, hoping that something would develope from Share 71.5 or elsewhere. Then I read the statement from IBM in Europe (was that Germany?) supporting Code Page 500 for "international" use; that bothers me since I am partial to CP 037, (see below). We are a "traditional VM EBCDIC" site: we have a 7171, run Kermit quite a bit, have an RSCS connection to dozens of VAXen, etc. What I can now call CP 37 V2 will work VERY well for us. (Over and over again) because of the defacto translation in WISCNET and cousins, CP 37 V2 is an easy extension of "EBCDIC". As I am a C fan of late, the 7 points different between CP 037 and CP 500 is the biggest headache IBM has created lately (second to CP R5). In C, ! means "logical NOT" and | means "bitwise OR". Furthermore, != is the relation "does not equal" and |= is bitwise OR assignment. Brackets [] (I now use points AD and BD without fear!) are used to sub- script arrays. In a world without CP 500, multiple code points can be mapped to a signle meaning without having to toggle a compiler option switch, as in AD, BA, and (from "the 3180 set") 41 all being mapped to an open bracket. But enter CP 500 and such "universal mapping" fails. "It is free; it is not cheap." Rick Troth - Chris Osborne TAMCBA VM Operations Texas A&M College of Business 29-Nov-88 14:37:38-GMT,1540;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA13335; Tue, 29 Nov 88 09:37:34 EST Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Tue, 29 Nov 88 09:37:18 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6376; Tue, 29 Nov 88 09:37:16 EDT Received: by BITNIC (Mailer X1.25) id 5902; Tue, 29 Nov 88 08:56:37 EST Date: Tue, 29 Nov 88 08:39:11 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edwin Hart Subject: Re: Code Page 037 vs. 500 To: Frank da Cruz In-Reply-To: Your message of Wed, 23 Nov 88 12:43:09 CST I am writing the requirements in the paper now. Basically we ask that IBM standardize on one CECP for Latin alphabet no. 1 as defined in ISO 8859-1. Although the wording will say SHARE prefers Code Page 37 over 500, we say that CP 500 is acceptable if the EBCDIC compilers are modified to use CP 500 code points. Furthermore, if IBM selects CP 37 as the base, we require that either the IBM PASCAL and C compilers (on mainframe and midrange) be changed to use the BA, BB brackets or that CP 37 v2 be defined with brackets in the AD, BD code points. In any case some kind of translation utility is required for migration, particularly in Europe. Wait for the exact wording. Ed Hart 15-Dec-88 13:36:32-GMT,1876;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA00874; Thu, 15 Dec 88 08:36:27 EST Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Thu, 15 Dec 88 08:38:44 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6322; Thu, 15 Dec 88 08:38:42 EDT Received: by BITNIC (Mailer X1.25) id 8918; Thu, 15 Dec 88 08:38:54 EST Date: Thu, 15 Dec 88 14:31:00 CET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: sc2 To: Frank da Cruz Dear list subscribers I paused a while with contributing to the list. Thorough comments on the various proposals require more time than I had as yet, but some news from the ISO JTC1/SC2 meeting in London, 17-21 Oct. may interest you. I also attended WG2, Multiple octet coding. Of course nothing was said about EBCDIC, but Mr. W. F. Bohn was there, and at least two other people from IBM. A number of Resolutions was adopted, I'll give the text in my next contribution. Some comments I cannot leave to a later moment. I was quite perplexed when reading that CP037 has "versions". What does this mean? Also it was proposed to copy things from Postscript. It should be remembered that Adobe is taking an active part in ISO standardization and may change its code tables to those from ISO just the moment it likes. Besides that, ISO is now busy with developing the successor of Postscript called SPDL, Standard Page Description Language. Doing the actual work are people from Adobe, Xerox and IBM. FROM J. W. van Wingen MOSGLA@HLERUL2 Mail to P. O. Box 486, 2300AL Leiden, Netherlands 12-Jan-89 23:47:35-GMT,1997;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA20709; Thu, 12 Jan 89 18:47:27 EST Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Thu, 12 Jan 89 18:46:25 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0810; Thu, 12 Jan 89 18:46:23 EDT Received: by BITNIC (Mailer X1.25) id 8384; Thu, 12 Jan 89 18:48:25 EST Date: Thu, 12 Jan 89 18:44:04 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Brian Eliot Subject: IBM 3174 code page/character set description To: Frank da Cruz A new manual I just received contains some interesting material on this topic. It is GA27-3831-0 "3174 Subsystem Control Unit Character Set Reference" I believe this replaces the earlier manual GA27-2837 "IBM 3270 Information Display System Character Set Reference" which applies to older 3270 control units. The items I noted were 1. 3270 national language support is described in terms of "code pages" and "character sets" rather than the earlier "I/O interface codes". Thus the description clearly distinguishes character sets, character generators (display hardware), and code pages. 2. There is a description of the values returned by a Query Reply (Character Sets) structured field. This query may be used to ask the terminal what character set/code page combinations it supports. Only a few terminals support the CGCSGID field. 3. In conjunction with the manual GA23-0214-3 "3174 Subsystem Control Unit Customizing Guide" you can find out about Country Extended Code Page (CECP) support. 4. A mapping is implicitly defined for certain control codes between EBCDIC and ASCII-8 (a.k.a. ISO 8859). 12-Jan-89 22:16:53-GMT,1561;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA16573; Thu, 12 Jan 89 17:16:48 EST Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Thu, 12 Jan 89 17:15:46 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0586; Thu, 12 Jan 89 17:15:45 EDT Received: by BITNIC (Mailer X1.25) id 0476; Thu, 12 Jan 89 17:16:55 EST Date: Thu, 12 Jan 89 16:21:32 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Frank da Cruz Subject: ISO8859 vs Kermit To: Frank da Cruz We are looking into the possibility of adding ISO8859 transfer syntax to the Kermit protocol, to allow for transfer of textual data in other than the Roman ASCII alphabet, including the transfer of text in mixed alphabets. Unfortunately, I have yet to see the actual 8859 documents, and I don't really understand how one transmits (or stores) text in mixed alphabets. Is there some kind of meta-character or sequence that introduces an "alphabet shift", followed by a code that designates the alphabet to be used? If so, can anyone describe the actual mechanism, what the alphabet codes are, etc? (Not the alphabets themselves! Just the mechanism for identifying them and switching among them.) Any information, insights, suggestions, caveats, etc, would be most appreciated. 16-Jan-89 15:45:20-GMT,5629;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA26322; Mon, 16 Jan 89 10:45:16 EST Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Mon, 16 Jan 89 10:44:21 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 4417; Mon, 16 Jan 89 10:44:19 EDT Received: by BITNIC (Mailer X1.25) id 2349; Mon, 16 Jan 89 10:45:57 EST Date: Mon, 16 Jan 89 14:00:17 +0100 Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Andre' PIRARD Subject: Re: ISO8859 vs Kermit To: Frank da Cruz In-Reply-To: Message of Thu, 12 Jan 89 16:21:32 EST from >We are looking into the possibility of adding ISO8859 transfer syntax to the >Kermit protocol, to allow for transfer of textual data in other than the Roman >ASCII alphabet, including the transfer of text in mixed alphabets. Nice to meet you here too, Frank, Well, it's true ISO and ANSI define escape mechanisms to switch from one character set to another and in particular between the G0 and G1 sets of a single version of ISO 8859 when transmitted over a 7-bit line. I don't think the intent is to define a means to store the data, what Kermit is involved in transmitting. It would both be very inefficient in terms of storage space and ease of processing and take us back to the previous situation where accented letters were stored in the form of printer-ready symbols overstrikes, exactly what ISO is trying to avoid. While these escape mechanisms can be used to implement a super terminal (this may apply to a Kermit's terminal mode) which would know all ISO 8859 versions and would be driven by a fancy host, this host would be better off storing its data in 16-bits or more elements. Consequently, Kermit would transmit these. I think that the ISO8859 versions are exclusive, but that they must translate the same way between ANSI and EBCDIC. IBM switches character sets, but does not mix them. 16 or more bits codes is a final solution, but puts a heavy load upon hardware. The only place I've read anything like it but theory is in the OS/2 technical manual which speaks of DBCS in chapter 6 ("Language DBCS environment vector of lead bytes", how filename elements are not truncated in case DBCS is involved and such faint remarks I'd like to know more about). But DBCS still means "double byte character sets" and does not look like true 16-bit codes. Anyone knows more about that? As to Kermit dealing with ISO 8859, I've done that between IBM PC and CMS, and it may be interesting to explain how. Both the CMS (and it could be TSO) through the 7171 and the IBM PC act as ISO 8859 host and terminal respectively, because I assume every byte that travels on the communication line is (at least supposed to be) coded in ISO. Which version is irrelevant if I'm right in saying all versions translate the same between ANSI/ISO and an IBM mainframe code page. The IBM world is the worst case, because code pages for a single ISO version are multiple on the same machine. The working in a different ISO version would just involve a code page switch in terminal mode and when having DOS process the data. - The 7171 translate tables have been set up to translate the host code page to/from ISO 8859/x. Which code page for the /x version is used (037v2 or 500) is selected by the answer to the terminal type request. - CMS Kermit translate tables have been modified to extend ASCII/EBCDIC translation to ISO/CECP 037v2 to minimize dynamic redefinition. E. G. selecting CECP 500 is now a handful of SETs. Thanks to John Chandler for a versatile file transfer translation support. - The program on the micro translates transferred text files from the line's ISO code to/from a user's selectable one (437, 850 or ISO itself which means no translation). This is super easy to add to any Kermit (just the user interface causes problems). - It does the same for terminal mode. Easy too: SI/SO + a simple translation some already do. This is in line with the idea I once developed on the Kermit lists that using ISO as the inter-systems vehicle really simplifies the handling and user understanding of the various IBM or other's codes (each system deals with its own(s)). In fact, I've made a step beyond that. In addition to the one for file transfer, the translations made in the program on the micro are made at the keyboard and screen interfaces. This means it processes the ISO code in memory (but it could be any) and never does translation of the line code. The internal encoding of the program's messages is ISO; this makes them independent of the code page the systems uses. Two translation are made. One for menu mode and the other for terminal mode. The one for menu mode is the translation between ISO and the system's code page at startup. The one for terminal mode is the user's choice and may also implies a code page the system is asked to switch to each time the user enters terminal mode. Again, there is no restrictions on which code can be used. New ones can be added to the program by configuration, including the null-translation-throughout so that it remains compatible with any Kermit implementation. I hope this will help. Andr). 16-Jan-89 20:46:31-GMT,1906;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA08283; Mon, 16 Jan 89 15:46:28 EST Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Mon, 16 Jan 89 15:45:33 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 4742; Mon, 16 Jan 89 15:45:31 EDT Received: by BITNIC (Mailer X1.25) id 5200; Mon, 16 Jan 89 15:47:36 EST Date: Mon, 16 Jan 89 10:49:33 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edwin Hart Subject: Re: ISO8859 vs Kermit To: Frank da Cruz In-Reply-To: Kermit and ISO 8859 For character set switching, see the ISO 2022 standard. The ANSI X3.134.1 standard, 8-bit ASCII Structure and Rules appears to have the information you will need. ANSI X3.134.2 is the U.S. equivalent of ISO 8859-1. However, as I read it, it allows the 8859-1 characters to exist in the 7-bit world. This may be of interest to you. You should also read about IBM Country Extended Code Pages (9 of them) which have the same character set as ISO 8859-1, and PC Multilingual Code Page 850. (See SHARE 69 Proceedings, pp. 19-28, August, 1987.) With respect to ISO 8859-2, and the corresponding IBM Code Page, the translate table for these two is DIFFERENT from the one for 8859-1 to CECPs. I do not know about the other 8859 and IBM code page translate tables. If you have an IBM APA printer and SCRIPT, I can send you code tables for ISO 8859-1, CECP 37 v1. Data Processing Code Page, and CECP 500 v1. Office Systems Code Page. The code tables print correctly except for about 5-10 characters. ISO standards are available from ANSI which is right in New York City. Ed Hart 24-Jan-89 12:15:21-GMT,3335;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA29846; Tue, 24 Jan 89 07:15:09 EST Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Tue, 24 Jan 89 07:13:04 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 5066; Tue, 24 Jan 89 07:13:02 EDT Received: by BITNIC (Mailer X1.25) id 3455; Tue, 24 Jan 89 07:14:59 EST Date: Tue, 24 Jan 89 12:45:00 CET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: Parts of ISO 8859 To: Frank da Cruz Dear list subscribers Here is the information that was wanted about ISO 8859. ISO 8859 8-bit single byte coded graphic character sets ISO 8859/1 1987-02-15 Latin alphabet no. 1 ISO 8859/2 1987-02-15 Latin alphabet no. 2 ISO 8859/3 1988-04-15 Latin alphabet no. 3 ISO 8859/4 1988-04-15 Latin alphabet no. 4 DIS 8859/5 (1988-03-15) Latin/Cyrillic alphabet ISO 8859/6 1988-08-15 Latin/Arabic alphabet ISO 8859/7 1987-11-15 Latin/Greek alphabet ISO 8859/8 1988-06-15 Latin/Hebrew alphabet DIS 8859/9 (1989-02-15) Latin alphabet no. 5 ISO 9036 1987-04-15 Arabic 7-bit coded character set for information interchange (date for a DIS means "voting terminates on:".) There is a list of languages covered by each of the 9 parts, under "Field of application". This includes: for Part 1: Spanish, Portuguese, Italian, French, English, Irish, German, Dutch, Danish, Faeroese, Icelandic, Norwegian, Swedish, Finnish. for Part 2: English, German, Czech, Slovak, Hungarian, Polish, Rumanian, Serbocroatian, Slovene, Albanian. for Part 3: Spanish, Italian, French, English, German, Dutch, Afrikaans, Catalan, Maltese, Turkish, Esperanto. for Part 4: English, German, Danish, Greenlandic, Norwegian, Swedish, Finnish, Lappish, Estonian, Latvian, Lithuanian. for Part 5: English, Russian, Byelorussian, Ukrainian, Bulgarian, Serbocroatian, Macedonian. for Part 9: (as Part1, but with Turkish instead of Icelandic) Annex A gives: "The coded character set of this part of ISO 8859 contains graphic characters used in at least the following countries:". This includes: for Part 1: all countries of North, South and Middle America, Australia, New Zealand, Spain, Portugal, Italy, France, United Kingdom, Ireland, Switzerland, Liechtenstein, Austria, Germany, Belgium, The Netherlands, Luxemburg, Denmark, Faroe Islands, Iceland, Norway, Sweden, Finland. for Part 2: Switzerland, Austria, Germany, Czechoslovakia, Hungary, Poland, Romania, Yugoslavia, Albania. The Parts 1,2,3,4,9 include MULTIPLY and DIVIDE, always with the same code. Parts 5,6,7,8 do not. Correspondence between ISO and ECMA standards ISO ECMA Registration number of escape sequence (ISO 2375) 8859/1 94 100 8859/2 94 101 8859/3 94 109 8859/4 94 110 8859/5 113 111 8859/6 114 127 8859/7 118 126 8859/8 121 138 8859/9 128 148 FROM J. W. van Wingen MOSGLA@HLERUL2 Mail to P. O. Box 486, 2300AL Leiden, Netherlands 1-Feb-89 17:19:02-GMT,1904;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA29458; Wed, 1 Feb 89 12:18:28 EST Received: from CUVMB.CC.COLUMBIA.EDU(MAILER) by CUVMB.CC.COLUMBIA.EDU(SMTP) ; Wed, 01 Feb 89 12:13:15 EDT Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6487; Wed, 01 Feb 89 09:33:07 EDT Received: by BITNIC (Mailer X1.25) id 5236; Wed, 01 Feb 89 10:30:55 EST Date: Wed, 1 Feb 89 14:53:00 CET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: ISO 8859-5 To: Frank da Cruz Dear list subscribers I just received ECMA Memento 1989. It includes a list of ECMA Standards, with the remark: "Free copies of all documents listed below are available upon request." They are mostly identical of those of ISO. The address is ECMA Headquarters, Rue du Rhone 114, CC-1204 GENEVA, Switzerland, (Telex 222.88, after 1989-06-14 41.32.37). The document numbers are in my previous mailing. As for Cyrillic (8859-5), the code is NEW (from the USSR). Col.s 11,12 now contain 32 capitals and 14,15 32 small letters in the CORRECT alphabetic order. Col. 10 contains the capitals of Jugocyrillic etc., and col. 15 the small ones. In 10 there is NBSP, E-trema, Dj, G-acc, Ukr. E, Maced. S, I, I-trema, J, Lj, Nj, H-barred, K-acc, SHY, U-short, Dz. In 15 there is "No" at 15/00 and "SS" (paragraph sign) at 15/13. Note that the Jugoslav Nat. Standard is different, conforming to the alphabetic order of the Latin transliteration, (just like the old GOST). DIS 8859-5.2 contains several mistakes in the letter names. FROM J. W. van Wingen MOSGLA@HLERUL2 Mail to P. O. Box 486, 2300AL Leiden, Netherlands 10-Feb-89 14:29:42-GMT,1391;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA21089; Fri, 10 Feb 89 09:29:37 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.CC.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 4820; Fri, 10 Feb 89 09:27:09 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 9342; Fri, 10 Feb 89 09:27:08 EST Received: by BITNIC (Mailer X1.25) id 8149; Fri, 10 Feb 89 10:28:02 EST Date: Fri, 10 Feb 89 15:03:00 CET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: ISO10646 To: Frank da Cruz Dear list subscribers First Draft Proposal DP 10646, Multiple octet coded character set (SC2 N 1987) arrived last Monday. It is 140 pages. The voting period ends 1989-05-30. It is under the care of ISO/IEC JTC1/SC2/WG2. It will have considerable influence on coding in the next decade. To give you an impression on what it is about, I'll mail a copy of an Informal Introduction on it (3 pages). Be warned, it contains box characters, conform to GT12 (or even TN?). All the letters, however, are orthodox. FROM J. W. van Wingen MOSGLA@HLERUL2 Mail to P. O. Box 486, 2300AL Leiden, Netherlands 10-Feb-89 15:02:53-GMT,11012;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA23517; Fri, 10 Feb 89 10:02:45 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.CC.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 4834; Fri, 10 Feb 89 10:00:16 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 9387; Fri, 10 Feb 89 10:00:15 EST Received: by BITNIC (Mailer X1.25) id 8442; Fri, 10 Feb 89 10:56:02 EST Date: Fri, 10 Feb 89 15:27:00 CET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: Informal Introduction to ISO 10646 To: Frank da Cruz 1 INTERNATIONAL ORGANIZATION FOR STANDARDIZATION ISO/IEC JTC1/SC2/WG2 INTERNATIONAL ELECTROTECHNICAL COMMISSION N 274 Joint Technical Committee 1 Subcommittee 2 Characters and Information Coding, Working Group 2 ====================================================================== Introduction to ISO 10646 - Multiple-Octet Coded Character Set ====================================================================== A new standard is being developed within Working Group 2 of ISO/IEC JTC1/SC2 for the multiple-octet coded character set. Formal drafts will be issued during 1989. Its purpose is to provide a single character code which will permit + _______ the written form of all present-day languages throughout the world to be used within computers, to be processed and interchanged. All types of text written in character form will be provided for, from simple commercial documents to publication of technical reports etc. Also the bibliographic requirements of librarians will be met. The structure of the whole code may be illustrated thus, with an octet + _________ _____ of bits for each dimension: ZDDDDDDDDDDDDDDDDDDD? ZDDDDDDDDDDDDDDDDDDD? 3 ZDDDDDDDDDDDDDDDDDDD? 3 3 ZDDDDDDDDDDDDDDDDDDD? 3 3 3 Plane ZDDDDDDDDDDDDDDDDDDD? 3 3 3 3 / ZDDDDDDDDDDDDDDDDDDD? 3 3 3 3 3 / ZDDDDDDDDDDDDDDDDDDD? 3 3 3 3 3 3 ZDD> ZDDDDDDDDDDDDDDDDDDD? 3 3 3 3 3 3 3 3Cell 3 3 3 3 3 3 3 3 3 3 3 ZDDDDDD? ZDDDDDD 3 3 3 3 3 3 3 V 3 3 A00 3 3 A01 3 3 3 3 3 3 3 3 Row 3 DDDDDD DDDDDD 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 J1 3 3 DD 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 @DDDDDDY @DDDDDD 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3DDDDY 3 ZDDDDDD? ZDDDDDD 3 3 3 3 3DDDDY 3 3 A10 3 3 A11 3 3 3 3 3DDDDY (future 3 DDDDDD DDDDDD 3 3 3DDDDY standardization) 3 3 3 3 3 3 3DDDDY (Korean) 3 3 C1 3 3 K1 3 3DDDDY (Japanese) 3 3 3 3 3DDDDY (Chinese) @DDJDDDDDDJDDJDDDDDDY (bibliographic) Basic multi-lingual plane Supplementary planes The basic multi-lingual plane will contain four segments for graphic + _________________________ ________ characters, each holding 96 * 96 characters. Each segment will be divided into two zones: an alphabetic zone of + _____ 16 * 96 characters, and another zone either for the most-frequently used characters of the Chinese, Japanese and Korean ideographic scripts, or for certain special purposes. The shaded area outside the graphic quadrants will be used for control + _______ functions. All those of ISO 6429, ISO 6937 and ISO 8613 will be + _________ available, with the same coding. The supplementary planes will accomodate characters that overflow from + ________________________ the basic multi-lingual plane. 1 A coded character anywhere in the code may be uniquely identified by means of three octets: m-s ZDDDDDDDDDDDDDD>DDDDDDDDDDDDDD>DDDDDDDDDDDDDD? l-s 3 Plane-octet 3 Row-octet 3 Cell-octet 3 @DDDDDDDDDDDDDDJDDDDDDDDDDDDDDJDDDDDDDDDDDDDDY NOTE: Sequences of characters run horizontally along the rows, not vertically as in previous code tables. The code may be used in different forms-of-use: + ____________ a) A four-octet form, in which the three octets for the character are preceded by one for systems use. Three octet coding will never be used. b) A two-octet form, restricted exclusively to a single plane. Especially for users with alphabetic scripts, this will accomodate probably 99% of their applications. c) A two-octet form with extension using occasional four-octets. d) A compacted form, permitting strings of related characters to be used as single-octets. The basic multi-lingual plane is being designed to permit easy inter-working with existing 8-bit codes. Generally, conversion will be by the table look-up technique; however, conversion with ISO 8859 parts 1,2,5,6,7,8 may use a simple algorithm. All designation, invocation and shifting as in ISO 2022 will be avoided. + _______ It is considered that the consequent simplification of software, + __________________________ especially for generalized applications in the OSI environment, will make this code economically attractive despite the the relatively extravagant use of bits. The layout of the basic multi-lingual plane may be illustrated in + ______ _________________________ FIGURE 1 (next page), the axes being not drawn linearly. NOTE: The value of any octet is shown in simple decimal notation, e.g. 032, 255. The contents of any of the rows are set out in detailed code tables. + ____________________ These are drawn on a pro-forma which shows a complete row in twelve strips, each of 16 graphic characters. Because the code is designed to be used as a whole, especially the basic multi-lingual plane, no significance attaches to whether certain characters are in the left hand or right-hand halves of a row, or early or late in the code table. A character once included in the code table is not duplicated elsewhere. Therefore for any particular application characters will be taken from many different places in the code table. For example users within Greece will find Greek letters in row 040, the equivalent Latin letters they use for transliteration in row 032, and some symbols they use in row 034. It will be trivially easy to adapt any equipment designed for the Japanese or Chinese scripts to provide all the characters of the basic multi-lingual plane. Therefore it is expected that suitable cost-effective equipment will become readily available. + ________________________ The feature of fixed length coding, especially in the two-octet + ___________________ mode-of-use, will make this code very easy to use in high-level programming languages and other software as employed for OSI and ODA. Hugh McG Ross, editor. Revised Oct. 1988 1 FIGURE 1 ISO 10646 Structure of the basic multi-lingual plane / / / / / Row. /000/032 Cell-octet 126/ /160 255/ oct.ZDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD? 0003 3 3 ZDDDDDDDDDDDDDDDDDDDDDDD? ZDDDDDDDDDDDDDDDDDDDDDDD 0323 3 Latin script for 3 3 European languages 3 \ 0333 3 ISO 8859-1 and -2 3 3 and ISO 6937-2 3 \ 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD \ 0343 3 Extended symbols 3 3 from ISO 8879 3 \ 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD \ 0353 3 Extended Latin 3 3 script for 3 \ 3 3 all world 3 3 languages 3 \ 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD \ 0373 3 Special African and 3 3 phonetic letters 3 \ 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD Alphabetic 0383 3 Cyrillic script for 3 3 major languages 3 3 3 Cyrillic for all 3 3 minority languages 3 scripts 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD / 0403 3 Greek script 3 3 for all 3 / 3 3 forms of 3 3 writing 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD / 0423 3 Arabic script for 3 3 all languages 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD / 0433 3 Hebrew 3 3 script 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD / 0443 3 Other 3 3 scripts 3 / 3 3 3 3 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD 0483 3 Japanese 3 3 Special Purpose 3 Ideographs 3 3 JIS X 0208 3 3 3 1263 3 3 3 3 3 @DDDDDDDDDDDDDDDDDDDDDDDY @DDDDDDDDDDDDDDDDDDDDDDD 3 3 3 ZDDDDDDDDDDDDDDDDDDDDDDD? ZDDDDDDDDDDDDDDDDDDDDDDD \ 1603 3 3 3 3 \ 3 3 Indian 3 3 scripts 3 \ 3 3 3 3 3 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD Alphabetic 3 3 Mathematical 3 3 symbols 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD / 3 3 Oriental 3 3 scripts 3 / 3 DDDDDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDDDDD 1763 3 Chinese 3 3 Korean 3 Ideographs 3 3 GB 2312 3 3 KS C 5601 3 2553 3 3 3 3 @DDJDDDDDDDDDDDDDDDDDDDDDDDJDDJDDDDDDDDDDDDDDDDDDDDDDDY 15-Feb-89 13:29:51-GMT,1151;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA01608; Wed, 15 Feb 89 08:29:47 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.CC.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 6673; Wed, 15 Feb 89 08:29:42 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6388; Wed, 15 Feb 89 08:29:41 EST Received: by BITNIC (Mailer X1.25) id 0664; Wed, 15 Feb 89 08:30:46 EST Date: Wed, 15 Feb 89 08:25:54 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edwin Hart Subject: Re: Requirements Feedback/Agreements and Disagreements To: Frank da Cruz In-Reply-To: Translations, AECS Requirement 6 Inheirent in ISO 8859-1 and the Country Extended Code Pages (EBCDIC) is a one-to-one mapping for the characters. We require that the one-to-one relation be extended to control characters. This will allow "round-trip" integrity for all data. See AECS Requirement 6. Ed Hart 15-Feb-89 17:42:38-GMT,1684;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA22518; Wed, 15 Feb 89 12:42:27 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.CC.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 6801; Wed, 15 Feb 89 12:42:34 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6915; Wed, 15 Feb 89 12:42:33 EST Received: by BITNIC (Mailer X1.25) id 8074; Wed, 15 Feb 89 12:06:41 EST Date: Wed, 15 Feb 89 10:37:02 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edwin Hart Subject: Re: Requirements Feedback/Agreements and Disagreements To: Frank da Cruz In-Reply-To: Your message of Wed, 15 Feb 89 08:52:52 EST >I would agree with rick's statement above that such translation be one-to-one >reversible. I would add another primary requirement: the code for any >printable ASCII character be translated to a EBCDIC code that represents >the same printable character. These two requirements will mean that some >printable EBCDIC characters are lost, but that is life! The IBM Country Extended Code Pages (CECPs) and ISO 8859-1 share the same character set. In other words, if a character is in a CECP, it is in 8859-1 and vice versa. Thus, for graphic characters (those which display), a one-to-one mapping exists. The pieces are already in place for your requirement IF you move to the 8-bit ASCII world of ISO 8859-1 (which uses ANSI X3.4-1986 (U.S. ASCII) as the left half of the code table). Ed Hart 15-Feb-89 18:13:19-GMT,3094;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA24860; Wed, 15 Feb 89 13:12:57 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.CC.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 6815; Wed, 15 Feb 89 13:13:04 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6991; Wed, 15 Feb 89 13:13:03 EST Received: by BITNIC (Mailer X1.25) id 8118; Wed, 15 Feb 89 12:07:02 EST Date: Wed, 15 Feb 89 16:22:00 CET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: SHARE Requirements 2 To: Frank da Cruz Dear list subscribers Here is my next installment. Some little writing errors first. P. 1 prividing --> providing P. 1 Polyas --> Pllya (P/olya) P. 2 coexistance --> coexistence P. 4 Stardards --> Standards Now the requirements: R8: This contradicts what is said in R1: " ........ End users should be concerned with using applications, not how " the character data is encoded. IBM must hide the way character data is " encoded. How the character data is coded must be invisible to end users " and applications developers. However, ................................... R22: Such a thing may be included. It shall, however, only express an INTENTION, not act as a barrier to interpreting data differently. Of course, this facility cannot be meant only for DURING THE MIGRATION. The last paragraph is far too optimistic in regarding the issues it reflects. R23: This is too vague to me. It should say that there should be as few borders as possible, acting as code barriers. IBM should state clearly that national CECP are only a short-term approach, and that a unique EBCDIC is what is aimed at, a compromise between CP037 and CP500. If and when that is said, we can start discussing with IBM what should be in it. With ISO 8859 we have only the East-West and perhaps the North-South code barrier, and if we succeed with the 254 char. set, we have even the Iron Curtain eliminated. A good question: how sacrosanct are cols 0-3 of EBCDIC? We may need them for the next conversion scheme. R25: ISO 646 is quite dead now, and will only be kept for the CCITT Telematic Services. R27: At present a printer will prints blanks for unprintables, which I prefer over the proposed options. R28: IBM will say: You are knocking at the wrong door. Nothing prevents you at going to ANSI or their counterparts with these ideas. A thing I missed is a position towards multi-byte sets. Do not overlook that IBM included support for it in TSO/ISPF and produced the 5550 for the Japanese market. Are we willing to code our Latin letters with two bytes instead of one, just for mixing more alphabets and scripts in one document in the future? Xerox has it, but that will not become the ISO Standard. FROM J. W. van Wingen MOSGLA@HLERUL2 Mail to P. O. Box 486, 2300AL Leiden, Netherlands 16-Feb-89 11:55:05-GMT,3191;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA05038; Thu, 16 Feb 89 06:54:58 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.CC.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 7166; Thu, 16 Feb 89 06:51:56 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 8203; Thu, 16 Feb 89 06:51:55 EST Received: by BITNIC (Mailer X1.25) id 4088; Thu, 16 Feb 89 06:52:19 EST Date: Thu, 16 Feb 89 12:48:00 CET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: SHARE req. 3 To: Frank da Cruz Dear list subscribers As a further installment I would like to discuss the use of the term "character set". ASCII is often called thus, but in fact the code is meant. There are two concepts, that of the set of characters, and that of the way these are represented by bytes. The ISO term for the first is "repertoire", (strictly speaking it is used only in ISO 6937, not in ISO 8859). We may introduce that term into the EBCDIC world too. Thus ISO 8859-1, CP037 and CP500 share the same repertoire, but have different coding, as do the several CECP's for Western Europe. CP850 contains this repertoire as a subset, with again different coding. The ASCII repertoire (7-bit) is a subset of all those in the 9 parts of ISO 8859, always with the same coding. The repertoire of ISO 8859-2 is identical to that of CP870 (as far known to me, can anybody tell me in which IBM manual it is defined?), but not with the same coding. I hope this will be helpful. Just as a bonus I offer the following text in German (from Goethe's Faust), which, I hope, I have correctly coded in CP037, (I am not going to provide a translation). It may serve as a motto to our effort, for it is an early description of a conversion algorithm, with appropriate comments by the Devil. Die Hexe (mit groer Emphase f Du mut verstehn! Aus Eins mach Zehn, Und Zwei la gehn, Und Drei mach gleich, So bist du reich. Verlier die Vier! Aus Funf und Sechs, So sagt die Hex, Mach Sieben und Acht, So ist's vollbracht: Und Neun ist Eins, Und Zehn ist keins. Das ist das Hexen-Einmaleins! Faust. Mich dunkt die Alte Spricht im Fieber. Mephistopheles. Das ist noch lange nicht voruber, Ich kenn es wohl, so klingt das ganze Buch; Ich habe manche Zeit damit verloren, Denn ein vollkommner Widerspruch Bleibt gleich geheimnisvoll fur Kluge wie fur Toren. Mein Freund, die Kunst ist alt und neu. Es war die Art zu allen Zeiten, Durch Drei und Eins, und Eins und Drei Irrtum statt Wahrheit zu verbreiten. So schw Wer will sich mit den Narrn befassen? Gew>hnlich glaubt der Mensch, wenn er nur Worte h>rt, Es musse sich dabei doch auch was denken lassen. Goethe, Faust Teil I, 2540-2566 FROM J. W. van Wingen MOSGLA@HLERUL2 Mail to P. O. Box 486, 2300AL Leiden, Netherlands 16-Feb-89 17:40:39-GMT,8033;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA27630; Thu, 16 Feb 89 12:40:31 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.CC.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 7322; Thu, 16 Feb 89 12:37:39 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 9192; Thu, 16 Feb 89 12:37:38 EST Received: by BITNIC (Mailer X1.25) id 4482; Thu, 16 Feb 89 12:09:57 EST Date: Thu, 16 Feb 89 08:38:37 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edwin Hart Subject: Re: SHARE req. 3 To: Frank da Cruz In-Reply-To: Code Page 500 versus 37: Compromise Needed? (This note started to a reply to Johan van Wingen's note with the Faust quote in German. But once I started writing it, I was writing about an area which concerns me. I believe that IBM will resolve the code page 37 versus 500 issue by supporting both of them for the long-term. To me, the political situation dictates that kind of solution from IBM. I would be interested in your thoughts.) Although I cannot read German, I know the Faust text came through correctly. The reason is that the code points (ISO code positions (hex values)) for code page 37 and code page 500 for the alphabet, numbers, and most other characters are exactly the same. They only differ for 7 characters and code points: Code Point 37 V1 500 V1 4A US cent  Left Square Bracket  4F Vertical Bar | Exclamation Point | 5A Exclamation Point ! Right Square Bracket ! 5F Logical Not ^ Circumflex ^ B0 Circumflex 5 US cent 5 BA Left Square Bracket Logical Not BB Right Square Bracket Y Vertical Bar | (The 37 V1 column uses CP 37 characters and the 500 V1 column uses CP 500 characters (I hope!). Ed Hart) I am concerned IBM will not standardize on one EBCDIC code page. With Europe, CP 500 seems to be firmly entrenched. In the US, Canada, and Portugal, CP 37 is entrenced. With this situation, I believe IBM will respond by narrowing from 9 CECPs to two CECPs: CP 37 v1 and CP 500 v1. They will do this to maintain data compatibility with data customers are already using to to avoid offending customers who have recently converted data to CP 37 or CP 500. Then IBM will build systems to automatically do the translations between CP 37 and 500 for mail, etc. An alternative to standardizing on both CP 37 and CP 500 is for ALL OF US to find a compromise code page somewhere between CP 37 and CP 500. The compromise must be something we can accept--because it cannot be perfect. Before suggesting anything, I want to raise the following issues: 1. Mainframe and Midrange Programming Languages depend on the US code page(s). Since the US Standard EBCDIC does not define code points for brackets, many products use the TN/T11 print train standard code points for brackets: X'AD' and X'BD'. 2. EBCDIC Code Point X'5F' should be reserved for the NOT function because it is ingrained in the IBM products. However, the ISO 8859 family of codes does not have the NOT character (cp 37 ^/ cp 500 ) in any code but ISO 8859-1. Consequently, the NOT character should not be allowed in programming language syntax. However, in EBCDIC, the compilers use code point X'5F' for NOT. For ASCII terminals, it is fairly common to map the ASCII circumflex (cp 37 5/cp 500 ^) into the EBCDIC NOT (cp 37 ^/ cp 500 5). (This may be the result of the ASCII-1968 standard which allowed the ASCII X'5E' code point to have "stylized graphics". If use of the NOT character (cp 37 ^/cp 500 5) is an issue to IBM, they should change the compilers to accept the code points for either the NOT or circumflex characters. 3. EBCDIC Code Point X'4F' should be reserved for the vertical bar character (cp 37 |/cp 500 Y) because it is ingrained in the IBM products. This is another code point and character used in programming languages for the OR function. 4. Brackets in CP 37 or CP 500 do not match the code points generally used, X'AD' and X'BD'. The Code Page 37 assignments for brackets are not widely used. Code points for brackets affect the PASCAL and C programming languages. Therefore, regardless of the code selected (CP 37 or 500), both PASCAL and C compilers must be changed for new code points for brackets. 5. The C language uses the exclamation point character (cp 37 !/cp 500 |). However, because of issue number 4, the C compiler must be changed for brackets. If C must be changed for brackets, changing C for a new code point for the exclamation point is not unreasonable. 6. To my knowledge, mainframes do not place any syntactic significance to the US EBCDIC code points X'4A' (cp 37 /cp 500 5) or X'B0' (cp 37 5/cp 500 ^). Therefore, character assignments to these code points is not as critical as the others mentioned earlier. Based on these issues, I would recommend a compromise code point assignment. This recommendation uses code point assignments for characters from both CP 37 and CP 500. The first two code points are the most critical assignments. X'5F' to circumflex (issue 2) (cp 500) X'4F' to vertical bar (issue 3) (cp 37) These assignment for brackets is a recommendation. X'4A' and '5A' to left and right brackets (issue 4) (cp 500) The reasons for this choice are: 1. The code points X'AD' and X'BD' are unavailable in CP 37 and CP 500, and I believe we should focus on fixing the differences between the two code pages rather than creating more differences. 2. The Code Page 37 code points for brackets are not widely used. 3. The X'4A' and X'5A' code points are in wide use in Europe in Country-specific EBCDIC code pages. 4. The code points are next to each other in the code table. Code points for the remaining characters may be defined by IBM. I believe that the assignments are not critical and therefore, we would waste time discussing assignments. If I am wrong, tell me. US cent (cp 37 /cp 500 5) Exclamation point (cp 37 !/cp 500 |) NOT (cp 37 ^/cp 500 5) What are your thoughs? 1. Should we continue to request one EBCDIC code page selected from cp 37 or cp 500? 2. Should we request cp 37 v2 with brackets at code points X'AD' and X'BD'? 3. Should we pursue a technical compromise similar to this one to solve what I perceive as very serious political problems? This assumes that one EBCDIC code page for ISO Latin alphabet number 1 is so critical that installations will be willing to convert to it, and those installations who have already converted to CP 37 or CP 500 would be willing to change again (They might be more willing if the character and code point changes had minimum effect on them; that is, they do not use the characters affected.). 4. Should we be prepared to accept the idea that the political situation will dictate a technical solution of two code pages: 37 and 500? Before you answer, please consider what kind of changes your installation will REALLY be willing to make to obtain one EBCDIC code for ISO Latin alphabet number one. Are US and Canadian installations really willing to convert their data, documents, and source programs to one EBCDIC code page if IBM selects code page 500 as the long-term solution? Are installations in Europe who have recently converted to code page 500, willing to make another conversion to code page 37 or to some compromise code page between code page 37 and 500? Thank you for all of your comments to date. Sincerely, Ed Hart 16-Feb-89 22:59:43-GMT,2459;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA23485; Thu, 16 Feb 89 17:59:38 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.CC.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 7471; Thu, 16 Feb 89 17:56:51 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 9809; Thu, 16 Feb 89 17:56:50 EST Received: by BITNIC (Mailer X1.25) id 5958; Thu, 16 Feb 89 17:57:49 EST Date: Thu, 16 Feb 89 14:25:40 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John C Klensin Subject: RE: Re: Requirements Feedback/Agreements and Disagreements X-To: ASCII/EBCDIC character set related issues To: Frank da Cruz Actually, the vertical bar / exclamation-point swapping is a result of the "national character use" positions of ISO 646 and early efforts to confine certain things, like Standards for programming languages that were initially defined in terms of EBCDIC, to the basic version positions. The controversy also included some strange discussions about whether ! (exclamation-point) "looked more" like EBCDIC "solid vertical bar" than "|" (ASCII broken vertical bar) did. It has been well over a decade since the predecessor of todays's ISO character set committees started sending little notes to programming language standards committees encouraging them (us) to clean up their acts and use *only* the basic character set of ISO646. Since the basic character set does not contain | (broken vertical bar at 7/12) and does not contain ^ (carat or circumflex at 5/14) or ~ (tilde at 7/14) either, the "obvious" solution was to map EBCDIC vertical bar into ISO646 2/1 (exclamation mark) and to do something creative with EBCDIC not-sign, like writing <> rather than ^= or ~=. And, of course, since the character set folks were willing to tell the language folks what *not* to do, but not what to do instead, there was no "standard" about the 'solutions'. Sometimes good intentions go a little astray. John Klensin, MIT Klensin@INFOODS.MIT.EDU To identify the perspective from which the above is written: Chair, ANSI X3J1 (PL/I); Project Editor for PL/I, ISO/IEC JTC1/SC22 17-Feb-89 16:57:51-GMT,7823;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA12654; Fri, 17 Feb 89 11:57:32 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.CC.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 7759; Fri, 17 Feb 89 11:54:44 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 1152; Fri, 17 Feb 89 11:54:43 EST Received: by BITNIC (Mailer X1.25) id 7044; Fri, 17 Feb 89 11:50:32 EST Date: Fri, 17 Feb 89 09:25:15 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John C Klensin Subject: RE: Re: SHARE req. 3 X-To: ASCII/EBCDIC character set related issues To: Frank da Cruz I find myself strongly in agreement with Ed's main point here, that, absent both a strong recommendation from the user community AND a clear willingness to bear pain, IBM will "compromise" on two code pages. That would be an improvement, but... A few small observations and quibbles: >2. EBCDIC Code Point X'5F' should be reserved for the NOT function > because it is ingrained in the IBM products. However, > the ISO 8859 family of codes does not have the NOT character (cp 37 ^/ > cp 500 ) in any code but ISO 8859-1. Consequently, the NOT character > should not be > allowed in programming language syntax. However, in EBCDIC, the compilers > use code point X'5F' for NOT. For ASCII terminals, it is fairly common to > map the ASCII circumflex (cp 37 5/cp 500 ^) into the EBCDIC NOT (cp 37 ^/ > cp 500 5). (This may be the result of the ASCII-1968 standard which > allowed the ASCII X'5E' code point to have "stylized graphics". > If use of the NOT character (cp 37 ^/cp 500 5) is an issue to IBM, they > should change the compilers to accept the code points for either the NOT > or circumflex characters. Several implementations of ISO-standard compilers permit either ASCII caret/circumflex or ASCII tilde as the appropriate stylization of what started our as an EBCDIC 'not'. With the introduction of 'not' in ISO8859-1, I expect that some vendors will decide to accept that too. Or, worse, instead. As I indicated in my note yesterday, parts of this conversion mess started out in the other direction. Unlike the custom in some of the communications and OSI Standards (the CCITT PAD standards are excellent examples), the ISO programming language Standards do not, in general, specify the codings of the character sets to be used, even in ASCII; their language is a more or less specific version of "use characters that look like this". That has resulted in some tough intra-ASCII conversion problems which some vendors, responding to perceived user needs, have resolved by mapping more than one ASCII character onto a given language character. All of this confuses the 'unambigious translation between ASCII and EBCDIC' problem considerably, since we can't unambiguously translate between ASCII and ASCII when the semantics assigned by a programming langauge to a character are considered. The three important examples that I know of are: EBCDIC NOT maps to ASCII caret/circumflex and/or ASCII tilde EBCDIC vertical bar maps to ASCII exclamation-mark and/or ASCII broken vertical bar (yesterday's discussion) EBCDIC (single-)quote maps to ASCII quote (i.e., double quote) and/or ASCII (acute) accent. Since, for all of the vendors who chose one of each of these code/graphics pairs and some of those who chose them as alternatives, the "other" character is permitted in strings, translation between one set of conventions or the other--and hence back to EBCDIC code pages-- that are semantics-preserving have to be done by a parsing process, sometimes with a few heuristics, rather than by character by character translation in a data stream. It makes it hard to make firm statements about what programming languages "do" or "should do". >6. To my knowledge, mainframes do not place any syntactic significance to the > US EBCDIC code points X'4A' (cp 37 /cp 500 5) or X'B0' > (cp 37 5/cp 500 ^). Probably nothing Standardized. X'4A' is the character that is often used as a stylization of ASCII back-slant/reverse-solidus in some software, especially terminal emulators and is used as a separator in some widely-circulated applications packages (precisely because it is not used by anything else). I know of nothing that uses X'B0', or anything else in column B in a critical way as a character with semantic significance, but that may be just my lack of knowledge. >Based on these issues, I would recommend a compromise code point assignment. >This recommendation uses code point assignments for characters from both CP 37 >and CP 500. > >The first two code points are the most critical assignments. > X'5F' to circumflex (issue 2) (cp 500) > X'4F' to vertical bar (issue 3) (cp 37) > >These assignment for brackets is a recommendation. > X'4A' and '5A' to left and right brackets (issue 4) (cp 500) This seems technically reasonable and politically attractive. >Code points for the remaining characters may be defined by IBM. I believe >that the assignments are not critical and therefore, we would waste time >discussing assignments. If I am wrong, tell me. I agree, but the recommendation must stress, perhaps even more strongly than the present text, that "defined by IBM" means "defined once, in one place", not "IBM may define a series of alternatives". > Exclamation point (cp 37 !/cp 500 |) > NOT (cp 37 ^/cp 500 5) Also see comments on these characters above. >From what I've seen of IBM's decision-making in other areas, they tend to prefer leaving those who are already unhappy in that state, rather than making, or even risking making, those who are happy less happy. Consequently, pushing even toward two (only) code pages is going to be a tough one. The case will, I think, be considerably strengthened if the people who want something are in a position to say "this change is going to hurt us a lot too, but it is important if there is going to be a future in which things are not worse". Part of the argument that should be made, and which I don't think Ed's draft makes clearly enough, is that, if we can get (a) Unambiguous and reversible mappings between ISO8859-n and EBCDIC CPm, with IBM agreement to specify the "official" 'n,m' pairs in a public way and to increase 'm' as needed as 'n' increases. There really is no alternative to this, unfortunately: if 'n' has to rise above 1 because of character set content (not just code point mapping), then the number of code pages will have to rise above 1. (b) A single, standard, compromise, EBCDIC code page to be use in IBM operating systems and products, especially programming languages and data communications, such that alternate code pages are used the way alternate ISO8859-n forms are used: locally or by control-sequence introduced departures from the 'standard'. And, as with ISO8859, the "alternate" code pages are built up from a common core that permits those operating systems and products to be completely standard across code pages. Otherwise, you just get the present chaos at a new point. ...then we will be satisified, if not happy. And, more important, IBM will be spared a strong case for replacing EBCDIC internally with ISO8859 at some point in the future, since no one (well, nearly no one) should care what they do internally as long as they communicate clearly at the boundaries. More than that is probably unrealistic to hope for. On the other hand, that is quite a lot. John Klensin 18-Feb-89 2:42:59-GMT,2397;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA05609; Fri, 17 Feb 89 21:42:56 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.CC.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 8032; Fri, 17 Feb 89 21:40:37 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 2119; Fri, 17 Feb 89 21:40:36 EST Received: by BITNIC (Mailer X1.25) id 0444; Fri, 17 Feb 89 21:34:38 EST Date: Fri, 17 Feb 89 14:42:37 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John C Klensin Subject: RE: Re: SHARE req. 3 X-To: ASCII/EBCDIC character set related issues To: Frank da Cruz > Personally, (you're all going to laugh)... > ... in such a way that EBCDIC and ASCII be transparent to the user. I promise not to laugh if you promise to not make me hold my breath. I'd expect to see a distinct temperature drop in the usual hot place sometime first. Not that it might not be a good idea, but all of us put together are not worth one large bank or insurance company in terms of getting IBM to change its ways, policies, or software. > The day will come when type "char" in C will be >16 bits rather than the current 8. That will be the day that most of the C programs in the world stop working. Keep in mind that this change will seriously alter the semantics of every C program that believes that 'char' == 'int' == one eight bit byte. Lots of stuff, including parts of the language definition, seem to depend on that assumption. What you might see instead is the introduction of 'longchar', with the use of 'char' gradually disappearing, but that is not transparent and not something that is likely to happen soon either. > EBCDIC would play >only a minor roll and then go the way of card punches. Clearly the "right" solution. Now let me introduce you to the guy in the next office who has been trying to get me to attach a card reader/punch to my VAX for the last four years so he can process his data archive (which closely resembles a row of 24 drawer grey cabinets in the hall). John Klensin, MIT (Klensin@INFOODS.MIT.EDU) 22-Feb-89 11:20:40-GMT,9164;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA28035; Wed, 22 Feb 89 06:20:35 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.CC.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 9571; Wed, 22 Feb 89 06:27:49 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 7285; Wed, 22 Feb 89 06:27:47 EST Received: by BITNIC (Mailer X1.25) id 5851; Wed, 22 Feb 89 06:28:06 EST Date: Wed, 22 Feb 89 12:19:00 CET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: ISO 8859 trouble spots To: Frank da Cruz Dear list subscribers The following document I intend to submit to ISO/JTC1/SC2/WG3 for their next meeting. But before doing that I would like to have your comments on it. &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 1 THE REMAINING TROUBLE SPOTS IN ISO 8859 0. Introduction The several parts of ISO 8859 have been approved a few years ago and are now being implememented increasingly. A lot of experience has been collected. In general the reaction has been that the standard is excellent, but some weaker points of the standard are now becoming visible. These should be discussed before habits grow entrenched. Applications in the field of programming languages have been the source of most of the comments. 1. The problem of diacritics There is a long tradition in the writing and printing industry for extending the available 26 letter Latin alphabet. Extra letters are created by putting a little mark over or under a letter. These are called "diacritical marks": accents, umlauts, cedilla's and so on. They are also used in some languages for putting a stress on a syllable. (Barring a letter is not considered applying a diacritic.) Where the number of available characters was severely restricted, as with typewriters, separate diacritics provided a solution with the practice of overprinting. This approach was copied in ISO 646 using BACKSPACE, and with ISO 6937-2 using non-spacing diacritics. ISO 646 provided only a few: underline, acute, grave and circumflex accent, diaeresis (umlaut), overline/tilde. These can also be used free-standing, that is without BACKSPACE, in which form they soon acquired a new meaning: low line, apostrophe, prime, upward arrow head, quotation mark. The comma could also be used for cedilla. This double use (already considerably reduced in ISO 646-1983) was not allowed in ISO 6937-2, where diacritics (a larger set) must occur only in predefined combinations with certain letters, or, exceptionally, with a SPACE. They are always non-spacing. In order to preserve the existing characters from ISO 646, ISO 6937-2 contains both a spacing and a non-spacing circumflex, grave accent and tilde. This introduces a double way of representing three characters. Astonishingly, the standard prefers for these three the single byte representation, the other "is deprecated". In ISO 8859 diacritics occur again. But all characters in it are always spacing without exception. However, diacritics have no meaning in itself. What is the use of a free-standing cedilla? One can only conclude that their presence is useless and a waste of valuable positions. Keeping them there can lead to two undesirable developments. First, implementers may violate the rules of ISO 8859 by making the diacritics non-spacing, or second, they may attach to them, when free-standing, a new meaning, as has been done with the circumflex, often used as "control". These characters deserve to be removed at the first opportunity. It will make it possible to include Turkish in ISO 8859-1. 2. The Logical OR and the Logical NOT A need for characters having the meaning of the Logical OR and the Logical NOT was introduced by PL/I (1964). The first compilers used EBCDIC. Thus the problem for ASCII and ISO 646 became apparent only somewhat later. As there were no positions left, some way of escape had to be found. 1 ASCII (USAS X3.4-1968) contains in 6.4: "No specific meaning is prescribed for any graphics in the code table except that which is understood by the users. Furthermore, this standard does not specify a type style for the printing or display of the various graphic characters. In specific applications, it may be desirable to employ distinctive styling of individual graphics to facilitate their use for specific purposes as, for example, to stylize the graphics in code positions 2/1 and 5/14 into those frequently associated with Logical OR and Logical NOT, respectively." (These graphics normally represent Exclamation Point and Circumflex.) In ISO R 646-1967 the text is somewhat different: "4.3 Interpretation of graphics The meaning of the graphics is not defined by this ISO Recommendation. It will be necessary to reach agreement on the meaning and this will depend upon the particular application except in cases where other ISO Recommendations already exist. However no interpretation may be chosen which is contradictory to the customary meaning. A graphic symbol can have more than one meaning, e.g. the graphical symbol - (minus) also can have the meaning of hyphen or separation mark. The font design of the symbol is not part of this ISO Recommendation." Mackenzie (2) comments on this: "The last sentence of Section 4.3 leaves the question of "font design" open; that is, a manufacturer could design Exclamation Point to look like Vertical Bar and Circumflex like NOT sign. The LOGICAL OR/Logical NOT problem had finally been solved." Unfortunately this was an illusion, as we shall see. In ISO 646-1973 we still find in 5.3: "The names chosen to denote graphic characters are intended to reflect their customary meanings. However, this International Standard does not define and does not restrict the meanings of graphic characters. In addition, it does not specify a particular style or font design for the graphic characters." In ISO 646-1983 we find at the end of 4. : "The names chosen to denote graphic characters are intended to reflect their customary meaning. However, this International Standard does not define and does not restrict the meanings of graphic characters. Neither does it specify a particular style or font design for the graphic characters when imaged." Graphic characters are distinguished by their name, not by their shape. In ISO 646 the Vertical Line turns up, that can be used for Logical OR, but that name is not included. Equally, Upward Arrow Head, Circumflex (for 5/14) is never additionally named Logical NOT. Thus a sound basis for using both in this way is missing. Nevertheless, widespread use of Vertical Line and Circumflex for OR and NOT could be found, just as * and / are employed for "multiply" and "divide". This development cannot easily be redressed. Thus it was a most unfortunate idea to include a new code for NOT in ISO 8859-1. Confusion was aggravated by not including it in ISO 8859-2. It continues to cause problems at attempting to establish a uniform translate table for EBCDIC - ISO8859. 1 3. Obsolete signs A compiler writer needs to know how a certain character in a program has to be classified, as a digit, as a letter (mostly it does not matter which) or a special character with a given meaning. Checking whether a byte is meant to be a letter would be easier if the letter areas of ISO 8859 would have been contiguous. Instead of that, quite obsolete characters for multiply and divide, for which * and / are used in programs for more then 25 years, have been inserted in the middle of a column. A look-up table is required to decide whether a character is considered a letter or not. Even if this cannot be avoided anyway, the introduction of unnecessary exceptions is always a bad thing, as is the destruction of a stable convention. It is no good if language designers are now going to be pressed for including two graphic symbols, meaning the same thing, into the syntax. Removing "multiply" and "divide" would make place for putting in the French ligature "OE" and "oe" again, which the logic of 6937 wanted to keep and that of 8859 wanted to go. 4. Icelandic versus Turkish Mixing characters from several parts from ISO 8859 requires invoking the help of ISO 2022, which much hardware does not support. This imposes a considerable cultural barrier between certain groups of nations. If this barrier coincides with one raised by world politics things are as they are. But if there is none, other priorities should dominate. We have now Latin alphabet no. 5 (ISO 8859-9), and it should be discussed whether or not one including Turkish should prevail over one with Icelandic. There are more as 100 times as many Turks as there are Icelanders. 23-Feb-89 0:47:38-GMT,1729;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA29033; Wed, 22 Feb 89 19:47:35 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.CC.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 0013; Wed, 22 Feb 89 19:45:28 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 8428; Wed, 22 Feb 89 19:45:27 EST Received: by BITNIC (Mailer X1.25) id 3817; Wed, 22 Feb 89 19:25:18 EST Date: Wed, 22 Feb 89 14:52:39 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edwin Hart Subject: Re: ISO 8859 trouble spots To: Frank da Cruz In-Reply-To: Criticism of ISO 8859 I read through your note on ISO 8859 problems. I agree. I would add that the PL/1 not symbol is only in ISO 8859-1 (and maybe -9, I have not seen -9). Also, many sites in North America map the ASCII tilde (7/14) into the EBCDIC Not. Formal logic courses frequently use tilde as the Not operator. The courses also use V for inclusive Or, and a circumflex-like character for logical And. At some of my SHARE presentations, several people said "Do not use the circumflex character to mean logical Not." In my note about a compromise EBCDIC code page for Reference EBCDIC-1, I proposed keeping the Not FUNCTION at EBCDIC code point X'5F' but using the circumflex CHARACTER there because circumflex was a character common to all of the ISO 8859 standards, and one would presume that people using other ISO 8859 parts would want to use PL/1 or other languages which use a Not symbol. Ed Hart 23-Feb-89 0:54:58-GMT,2902;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA29402; Wed, 22 Feb 89 19:54:55 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.CC.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 0017; Wed, 22 Feb 89 19:52:48 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 8444; Wed, 22 Feb 89 19:52:47 EST Received: by BITNIC (Mailer X1.25) id 3909; Wed, 22 Feb 89 19:30:21 EST Date: Wed, 22 Feb 89 16:19:27 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: "Nelson H.F. Beebe" Subject: Comment on ISO 8859 multiply and divide X-To: ISO8859%JHUVM.BITNET@CUNYVM.CUNY.EDU To: Frank da Cruz In-Reply-To: Message from "Johan van Wingen " of Wed 22 Feb 89 04:32:00-MST Johan van Wingen in a posting dated Wed, 22 Feb 89 12:19:00 CET remarks: >> ... Instead of that, quite >> obsolete characters for multiply and divide, for which * and / >> are used in programs for more then 25 years, have been >> inserted in the middle of a column. >> ... Removing "multiply" and "divide" would make place for >> putting in the French ligature "OE" and "oe" again, which the >> logic of 6937 wanted to keep and that of 8859 wanted to go. I have not seen a printed representation of these two characters. If, as I presume, they are a centered sans-serif x for multiply, and a minus with a dot above and below for divide, then there is another problem. In the English-speaking world, that symbol is used to mean division, but in Denmark (and possibly elsewhere in Scandinavia), it means subtract! While circumflex may have been used as a logical NOT in PL/1 environments running with ISO character sets, I would like to point out that in the C language, exclamation point is used as a Boolean (logical) NOT, tilde is used as a one's complement (another kind of NOT) and circumflex as an exclusive OR. It would surprise me if there is not now substantially more code extant in C than in PL/1. Given that both EBCDIC and the ISO character sets each contain an exclamation point, and each contain a (possibly-split) bar, it is foolish to consider mapping exclamation point into vertical bar. No responsible editor would permit a vertical bar to be used in natural language text to mean exclamation point, and the heavy use of both symbols in the C programming language for completely different purposes (that lead to syntactically correct, but semantically wrong, code, when the two are exchanged, as I have earlier pointed out on this list) require that a mapping of between exclamation point and vertical bar be discouraged, if not outright forbidden. ------- 23-Feb-89 14:48:41-GMT,1850;000000000001 Received: from CUVMB.CC.COLUMBIA.EDU by cunixc.cc.columbia.edu (5.54/5.10) id AA21150; Thu, 23 Feb 89 09:48:33 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.CC.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 0179; Thu, 23 Feb 89 09:46:35 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 9016; Thu, 23 Feb 89 09:46:34 EST Received: by BITNIC (Mailer X1.25) id 6564; Thu, 23 Feb 89 09:34:37 EST Date: Wed, 22 Feb 89 23:52:20 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: John C Klensin Subject: RE: Comment on ISO 8859 multiply and divide X-To: ASCII/EBCDIC character set related issues To: Frank da Cruz In those environments that run with EBCDIC, I would strongly suspect that there is more PL/I use than C use. Even today. And more COBOL use than either. There are also approved, cast-in-concrete ISO and ANSI Standards for PL/I and only Draft Proposals for C; your comments could be construed as "C should be changed prior to standardization, because it uses too many characters in violation of the style in which other programming languages, etc., use them". That is not a proposal or suggestion, serious or otherwise, just a comment about how things work. What is more important is that this type of semi-quantitative reasoning won't solve any problems. What it will do is to encourage the vendor to say "ok, different character sets for different audiences, since the market pressures run against goring the oxen of large customers", which is what we are trying to avoid. John Klensin, MIT 24-Feb-89 12:33:52-GMT,3034;000000000001 Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA04125; Fri, 24 Feb 89 07:33:49 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 0657; Fri, 24 Feb 89 07:31:58 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0750; Fri, 24 Feb 89 07:31:57 EST Received: by BITNIC (Mailer X1.25) id 7970; Fri, 24 Feb 89 07:14:40 EST Date: Thu, 23 Feb 89 01:59:06 PST Reply-To: "Joan M. Winters" Sender: ASCII/EBCDIC character set related issues From: "Joan M. Winters" Subject: Summary of Responses on Hex Codes for Curly Braces X-Cc: SAXTON@SLACSLD, JXH@SLACVM, WBJ@SLACVM, BEBO@CERNVM, COTTRELL@SLACVM To: Frank da Cruz Folks - Finally, here's my summary on what hexadecimal codes are actually used around the EBCDIC world to define curly braces (graphic characters {} at my place), primarily from you on the ISO8859 list. To simplify, of the 20 institutions total I heard from: 16 use X'C0' and X'D0' 3 use X'8B' and X'9B' 1 uses X'C0' and X'D0' for terminals, X'8B' and X'9B' for printers "Use" means these are the only, default, or primary codes for braces. Of the 16, 4 mentioned that by default they print both pairs of code points as braces, even though on input they encode braces only as X'C0' and X'D0'. Another provides such "bi-lingual" code sets for printers, but not by default. In addition to SLAC, 1 site has old Tektronix-style plotting software that considers braces to be X'8B' and X'9B', in spite of a general EBCDIC use of X'C0' and X'D0'. No organization mentioned plans to convert their code points for braces. However, 6 noted conversion within recent years to X'C0' and X'D0'; 3 within the last two years. 1 of the X'8B' and X'9B' places said they may change some things to accept both code pairs. Another seems already to have good support for both. The site with the X'8B' and X'9B' printer-only default has a new character set that prints braces for both code pairs. The places that had converted to X'C0' and X'D0' seemed basically content with the change. 1 site said they'd never convert again; 2 said if the standard required it, they would one more time. 1 organization even made a plea for being able to re-use the X'8B' and X'9B' codes points for other characters. Of the places that use X'8B' and X'9B', 1 said they'd most likely convert to a standard if such came to exist. It's hard to classify some responses. As usual in this area, answers often differ within an organization, depending on the exact circumstances. I'm bringing the mail I got to SHARE, for those of you who'll be there and are interested in the gory details. I enjoyed reading your notes, in all their variations. Thank you very much for your help! Joan Winters 7-Mar-89 22:12:59-GMT,9208;000000000201 Return-Path: <@cuvmb.cc.columbia.edu:ISO8859@JHUVM.BITNET> Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA17364; Tue, 7 Mar 89 17:12:54 EST Message-Id: <8903072212.AA17364@watsun.cc.columbia.edu> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA28389; Tue, 7 Mar 89 17:11:14 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 5041; Tue, 07 Mar 89 17:09:02 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 6947; Tue, 07 Mar 89 17:09:00 EST Received: by BITNIC (Mailer X1.25) id 4183; Tue, 07 Mar 89 17:04:18 EST Date: Tue, 7 Mar 89 15:27:55 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Edwin Hart Subject: White Paper Executive Summary To: Frank da Cruz Enclosed is a redraft of the Executive Summary of the paper. It is exactly two pages long on an IBM 3820 printer. It is 4 pages on a 1403. I would appreciate any comments by Friday (March 10). Thank you. Ed Hart Executive Summary . . . Let us go down, and there confound their language, that they may not understand one another's speech. Genesis 11:7 Unless IBM resolves fundamental character set and code issues, Systems Application Architecture (SAA) will fail to fully meet its consistency goal. Inconsistencies make using IBM equipment unnecessarily difficult. People find it difficult (1) to exploit PS/2s with mainframe and midrange systems, (2) to communicate business information and mail internationally, and (3) to exploit applications and high-level languages. Because of mainframe and communications inconsistencies on a PS/2, end users type certain characters and are confused by the results. Character set and code problems create a human factors trap for end usersthe very people SAA is to serve. The inconsistencies affect not only IBM's European customers6 but also IBM's U. S. and Canadian, English-speaking customers. In short, the inconsistencies make IBM systems more difficult to use for both naive and experienced end users, and this must change for SAA to succeed. Character Set and Code Problems Since the early 1970s, end users have experienced many problems with ASCII and EBCDIC character sets and codes. The fundamental problem occurs because certain characters change when people move them between IBM systems, MVS/TSO, VM/CMS, OS/400, and the PS/2. This problem consists of four interrelated facets. The ASCII and EBCDIC Character Sets and Codes Are Inconsistent. The ASCII and EBCDIC character sets do not match. Three ASCII and three EBCDIC characters exist in one code but not the other. Moreover, the ASCII standard evolved but many IBM products still reflect the back level, 1968 standard, rather than the 1977 or 1986 version. EBCDIC is not one code but a family of codes. People misunderstand this. In the U. S., end users use several EBCDIC codes (U. S. standard EBCDIC, TN/T11 print train, and various coded fonts for the IBM 3800 printer series, and office systems EBCDIC). End users are confused because the same character will have a different binary value assigned in different EBCDIC codes, and certain binary values will have different character assignments. As a result, users of IBM computers must be aware of the code being used. Translations between ASCII and EBCDIC Are Inconsistent. Depending on the computer and communications system, people obtain different results when certain keys are struck on a PS/2 or ASCII terminal. MVS uses translations different from VM; communication controllers use different translations than protocol converters; ASCII tapes have yet a different translation. End users cannot understand this. In addition, the IBM "standard" ASCII-to-EBCDIC translation makes no sense to English-speaking U. S. and Canadian customers, or to anyone else for that matter. For example, to force an end user to type the ASCII "!" to enter an EBCDIC "|", and the ASCII "[" to enter an EBCDIC "!", simply makes no sense| (oops) ! Required Characters Are Absent from ASCII and EBCDIC. Characters required for modern applications and programming languages are missing from ASCII and EBCDIC. High level languages require syntactically-significant characters to have specific binary values. For example, the NOT symbol, "^", must be X5F. To compensate, many installations modified the translate tables. High-level languages frequently allow alternate, multiple character sequences for the missing characters. However, end users insisted on typing just one characterespecially when the character is on the keyboard. Also, because U. S. standard EBCDIC lacks bracket characters, installations defined EBCDIC-to-EBCDIC translate tables for IBM 3270 terminals to use IBM's high level languages. IBM's Apparent Character Set and Code Strategy Is Inadequate. IBM appears to have embarked on a strategy which will resolve many of the problems. It seems to be based on standardizing on the character set of the ISO} 8859-1 standard which contains most of the characters required for Western European languages. For EBCDIC, IBM created nine Country Extended Code Pages by expanding the language-dependent EBCDIC codes to contain the full character set. For the PS/2, IBM created its PC Multilingual Code. With these changes, the Western European character set is available on all SAA computers. The International Organization for Standardization. Although we are beginning to see some benefits, this strategy is inadequate. It was designed so customers could avoid a data conversion. However, IBM has never announced any strategy. As a result, installations in Europe and North America are diverging by focusing on two different Country Extended Code Pages for the long-term. Requirements Because the problems and issues are interrelated, customers demand an integrated solution. The primary objective is to preserve the meaning of character data across SAA systems. This objective expands into four different requirement categories. 1. IBM needs an architecture for character sets and codes in SAA. Many of the end user problems result not from a lack of standards but from too many inconsistent standards. IBM must focus on one EBCDIC code and one ASCII code for the Western European character set. The paper refers to these as "Reference EBCDIC" and "Reference ASCII". IBM must announce its direction so customers can start planning. Implementing these requirements will solve many issues of the first three problem areas. Not implementing them will (a) put IBM at a disadvantage to competitors (like Digital Equipment Corporation) which use the ISO 8859-1 code, (b) will allow the existing proliferation of code inconsistencies to continue, and (c) make solving the problems later much worse. However, merely defining standards in SAA is insufficient. 2. IBM SAA products must exploit the "Reference EBCDIC" and "Reference ASCII" codes. People use computers for applications. Recall that current applications only support specific codes. Therefore, SAA products must use the "Reference EBCDIC" and "Reference ASCII" codes. 3. Installations require help migrating to the "Reference EBCDIC" and "Reference ASCII" codes. The migration period will extend over several years because customers face both IBM and non-IBM software conversions and have inventories of older equipment. The primary concerns are (1) to migrate once, (2) to minimize difficulties during migration, (3) to allow each installation to choose its own migration plan, and (4) to provide tools to assist migration. Implementing migration requirements will help customers rise above the mire of present problems. 4. SHARE must become more involved in Standards issues. This is an issue not for IBM but for SHARE. SHARE must influence standards to avoid future problems. This summarizes the SHARE requirements for resolving the problems and issues. They will not be easy to resolve. If they were, customers could have resolved them years ago. Resolution will require difficult decisions for IBM and its customers. Nevertheless, the decisions must be made. Some in IBM believe that nothing need be done now. This is untrue because the problems become worse every day. SAA provides a unique opportunity for IBM and its customers to break with past problems, and make a fresh start. But IBM must act quickly or lose the opportunity. Act now! 30-Mar-89 20:32:22-GMT,6876;000000000411 Return-Path: <@cuvmb.cc.columbia.edu:ISO8859@JHUVM.BITNET> Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA19127; Thu, 30 Mar 89 15:32:15 EST Message-Id: <8903302032.AA19127@watsun.cc.columbia.edu> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA22424; Thu, 30 Mar 89 15:29:26 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 4600; Thu, 30 Mar 89 15:27:44 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 0924; Thu, 30 Mar 89 15:27:43 EST Received: by BITNIC (Mailer X1.25) id 0137; Thu, 30 Mar 89 15:28:38 EST Date: Thu, 30 Mar 89 12:47:24 CST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Michael Sperberg-McQueen Subject: query about overstruck characters in ISO 8859 To: Frank da Cruz Johan van Wingen has pointed out several times in this forum that in ISO 8859, as opposed to ISO 6937, 646, and other earlier coded character sets, it is illegal to use backspaces to overstrike two characters as a method of obtaining a new character. At least, that's what I understood him to say. ISO 8859-1 : 1987 (E) says (paragraph 7) "The use of control functions, such as BACKSPACE or CARRIAGE RETURN for the coded representation of composite characters is prohibited by ISO 8859." I have two questions: (1) just what sorts of activities are supposed to be forbidden here? and (2) why? To be more specific: if I need to print a Serbo-Croatian word containing a 'c' with an acute accent, I could probably do any of the following things (depending on my system environment). Which of them are legal, and which illegal? And can we construct a rationale for the legality and illegality of each? (= *should* they be legal?) (a) embed the sequence 'c' BACKSPACE ´. (hex 63 07 B4) in my file (if I'm using an editor that allows me to embed backspace characters, as some do and some don't) and let the printer, the display, and other devices deal with it as best they can. The display will probably show me the acute, and the printer will do an overstrike, unless it's a line printer, in which case I may get a variety of things but almost certainly not what I want. (b) use a Script command like ".dc bs <" and then use the combination 'c<´.' in my file. Script will arrange to have the acute and the 'c' overstruck, either by issuing a backspace or by doing something else. (c) use the same Script command, and also define a Script symbol with ".sr cacute = 'c<´.'" or ".sr cacute = 'c&sysbs.´." and then in my file use "&cacute." instead of "c<´." (d) use some relevant system facility (either in Script or in a microcomputer word processor) to define the width of hex B4 as 0. Then send the sequence hex B4 63 to the printer. (e) use the editor or some (imaginary) Script facilities to embed a sequence like ESC '-' 'B' (hex 1B 2D 42) at the beginning of my file to set up ISO 8859-2 as my G1 character set, and then in my file embed SHIFT-IN X'B6' SHIFT-OUT (hex 0F B6 0E) for the acute-accented 'c' (f) embed the ESC '-' 'B' sequence in some way, use Script's symbol facility to define ".sr cacute = &x'0FB60E' " and then use "&cacute." in my file as usual. If I understand the text of paragraph 7, approach (a) is clearly in violation of the spirit and letter of the standard. What about approach (b)? In my file, I'm not using any control characters to create composite characters: only graphics. I don't expect any editor to resolve the multi-character encoding for me and display an accented 'c'. But I am, I admit, using backspace or CR in the printer stream (or if the printer is more sophisticated, maybe something even more devious). Or perhaps I'm not. I don't know what Script97 does with the Xerox 9700; all I know is that the ".sr" command given should give me something resembling the character I want on my output. Approach (c) is much the same as (b), except that a lot of these symbols are already defined at installation. Is it a violation of the standard to use them, if they produce backspaces in the printer data stream? Approach (d) avoids the backspace in the data stream, but probably violates another part of paragraph 7: "None of these characters are non-spacing." Approach (e) and (f) sound as though they are what the standards committee expects us to do. But given that very few pieces of software will handle such escape sequences, I am not sure what paragraph 7 can mean or is supposed to mean for sites, developers, or end users. If I cannot use character 11/4 (acute accent) to form composite characters, why is it there? For use in mathematics to distinguish symbols (K and K' = K-prime)? In that case it would be far better to use slots 11/4, 10/8, 11/8, and 10/15 to include Turkish, and define another single character set for all sorts of mathematical symbols. ("Lead us not into temptation.") I imagine the point of paragraph 7 must be to say that extension of the character set to handle things like accented 'c' should be done through the extension techniques defined by other ISO standards, and not by overstriking characters of the ISO 8859 sets. In an ideal world, all the equipment would support ISO 8859-1 through -9, and ISO 2022 and so on. But in the real world -- is it considered a violation of ISO 8859 to use non-standard code extension techniques in order to make non-conforming equipment produce appropriate results? Our printer probably doesn't have a-umlaut as a separate character. Is it a violation of paragraph 7 to write a printer driver that reads character 14/4 from a file and sends an overstrike sequence including BACKSPACE to the printer? Would it be a violation if the printer driver translated from ISO 8859 to ISO 6937? Frankly, I find the blanket prohibition against use of BACKSPACE and CR in paragraph 7 a bit confusing and don't believe I understand the logic behind it. I am involved in a large international project to formulate methods for encoding literary and linguistic data in machine-readable form. It is important that we be able to recommend sound practice for encoding diacritics. To me, that means practice which agrees with relevant standards. But it is also essential that the recommended practice be something that people can actually work with using the software that exists. So I am particularly interested in finding out what the character set committee had in mind when they wrote paragraph 7. -Michael Sperberg-McQueen Editor in Chief, ACH / ACL / ALLC Text Encoding Initiative University of Illinois at Chicago 31-Mar-89 2:05:08-GMT,2089;000000000001 Return-Path: <@cuvmb.cc.columbia.edu:ISO8859@JHUVM.BITNET> Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA20280; Thu, 30 Mar 89 21:05:06 EST Message-Id: <8903310205.AA20280@watsun.cc.columbia.edu> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA19487; Thu, 30 Mar 89 21:01:50 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 4791; Thu, 30 Mar 89 21:00:35 EST Received: from BITNIC.BITNET by CUVMB.CC.COLUMBIA.EDU (Mailer X1.25) with BSMTP id 1560; Thu, 30 Mar 89 21:00:34 EST Received: by BITNIC (Mailer X1.25) id 8444; Thu, 30 Mar 89 21:01:34 EST Date: Thu, 30 Mar 89 18:53:57 EST Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Frank da Cruz Subject: Re: query about overstruck characters in ISO 8859 X-To: ASCII/EBCDIC character set related issues X-Cc: Christine M Gianone To: Frank da Cruz In-Reply-To: Your message of Thu, 30 Mar 89 12:47:24 CST We share your curiosity about the ISO8859 prohibition on composite characters. Not that it doesn't make sense -- ISO 8859 wants a character to be a character, so that it is possible for character and string oriented software to deal with text in a uniform way. Hence ISO 8859 shuns the composite "character building" allowed by ISO 646, and *required* by CCITT T.61. Our curiosity, like yours, is about how mixed-alphabet data is to be stored on disk. This relates closely to an extension to the Kermit file transfer protocol that we're working on, for transferring text in mixed alphabets between unlike systems. If you'd like to read & comment on it, or want to be added to the "isokermit" discussion group, let us know. - Christine Gianone and Frank da Cruz 31-Mar-89 11:09:56-GMT,7097;000000000000 Return-Path: <@mitvma.mit.edu:KLENSIN@INFOODS.MIT.EDU> Received: from mitvma.mit.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA21290; Fri, 31 Mar 89 06:09:52 EST Received: from INFOODS.MIT.EDU by mitvma.mit.edu (IBM VM SMTP R1.2) with TCP; Fri, 31 Mar 89 06:09:44 EST Received: by INFOODS id <00002470066@INFOODS.MIT.EDU> ; Fri, 31 Mar 89 06:01:57 EST Date: Fri, 31 Mar 89 05:24:36 EST From: John C Klensin Subject: Overstruck characters and 8859 To: Frank da Cruz <@mitvma.mit.edu:fdc@watsun.cc.columbia.edu> X-Vms-Mail-To: EXOS%"Frank da Cruz <@mitvma:fdc@watsun.cc.columbia.edu>" Message-Id: <890331052436.00002470066@INFOODS.MIT.EDU> Frank, First of all, if you have an isokermit list going, please add me to it. Maybe, even though the newsletters seem to have stopped getting through to here, and info-kermit-request respondeth not, I can get that. Klensin@INFOODS.MIT.EDU I'm waiting until I have a chance to study the responses to the original question for a bit before I put together a response of my own (which, by then, may not be necessary) but let me provide a piece of the answer from a standards-policy viewpoint. One of the big problems with this evolving standards stuff is a global lack of coordination. We are at a sufficiently primitive point that "coordination" means "telling other people what you are doing", and we are not doing very well at that. ANSI has just initiated its third--in about as many years--attempt at a system for on standards developer notifying others when new projects are initiated. The other two fizzled out into nothing in short order and, in at least some respects, the ISO/IEC/CCITT situation is worse. Now, against that backdrop, ISO/IEC JTC1/SC2 and its ANSI/X3 equivalent ought to be forced to (a) make clear statements about what each of these character set standards is *for* and how each relates to, and can be translated to and from, any of the others and (b) understand that more alternatives is often a vice, not a virtue. Otherwise, they are headed, and heading us, rapidly down the path that IEEE 802 seemed to be going down for a while: you can "standardize" any network physical and link level technology you like, as long as you can write a clear specification. Better than not writing a clear specfication, I suppose. SC22 (ISO programming languages) has finally (after umpity years) established a strong liaison with SC2 and is beginning to say "look guys, some of these things are impossibly difficult in use, and there are some things you have to specify". It is not clear that will cure the problem. Anyway, CCITT's traditional goal has been clear--to transmit the maximum number of character representations down a communications line, with minimum switching around, and a minimum requirement for really fancy hardware at the far end. Hence a lot of overstrike logic. *Some* of the SC2 standards follow that tack, and are standards for transmission of characters over communications links. But, if you are trying to do a programming language system--especially if you are trying to compare, catenate, or overlay character strings--variable-length logical characters (which is what a graphic BS graphic amounts to) is a pain in the neck to deal with. Even the definition of the length of a string gets funny when length-in-"bytes" is not equal to length-in- logical-characters. So 8859 comes along, and, with good intentions and for good reasons, they say, or try to say, "no composite characters". And, of course, someone comes along and says "but I want to have composite characters, how do I do it?" In the 8859 world, you don't. You have a code point and, at any point, you need to know which 8859 element that code point is to be interpreted with respect to. That combination of character set and code point--I know of one experimental implementation in progress that simply canonicalizes all of the switching into and out of 8859 sets into representing each "character" internally with two octets, the 8859/n set and the code--gives a unique, testable character, under a rule that two character sets means two different characters, even if the graphics are the same. If you want a rule that says "if the graphics and/or character names are the same, the characters are the same", then you need a further canonicalizer that prefers, for example, low-numbered 8859 sets to high-numbered ones. And dealing with multiple sets requires very high tech devices, which can understand all of them and, presumably, bit map characters onto the screen. 'Taint a $400 terminal. The kermit meta-question depends on what you are trying to do, and what needs you are trying to solve and, to partially repeat what I've said earlier, the needs and requirements are different enough that I'd get out my ten foot pole and use it to define a boundary between "data transfer" and a lot of very complex data transformation issues. Let me suggest a nasty analogy. Plus or minus a certain amount of precision loss, it is possible to convert any floating point number representation into any other. I don't much favor the idea, but it would be possible to invent a way of defining floating point formats, and to define a "kermit-standard" floating point. You could then fix up an attribute packet that would say "this here file is completely in kermit-standard floating point" and expect that kermits at both ends would convert between local representations and that format. Problem is that either it would work only for files that contained nothing but floating point numbers, or you would have to invent a mechanism for flagging which values were floating point and which were something else. The number of "pure" floating point files drops each year, especially since people want to transmit, e.g., array dimensionality, with their data files. And, right after you headed down that slippery slope, we would be talking about a general kermit self-describing file. I would think about this as a way to describe the "thing" that is being transmitted--an atomic file, if you will. "thing" descriptions are pretty simple: 646Text. You-better-not-mess-with-this-"binary". 8859-1Text. 8859-nText, where "n" is another attribute. Now, there is nothing wrong with T.61Text as a "thing", as long as no one has delusions about conversions between graphic stylizations associated with T.61 and characters associated with 8859-n being performed automatically, especially in poor, helpless, kermit programs as distinct from converters with lots of user-adaptable tables and heuristics of their own. T.61, if I recall, lacks even the elementary required canonicalization rules that made string compares work on Multics (those are, effectively, designed around the "if it looks the same, it is the same" principle, something that 8859 implicitly disavows). john Identification of hat being worn as this is written: Chairman, ACM Standards Committee; Member, ANSI/ISSB. Klensin@INFOODS.MIT.EDU 31-Mar-89 16:05:07-GMT,37228;000000000001 Return-Path: Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA21708; Fri, 31 Mar 89 11:05:03 EST Message-Id: <8903311605.AA21708@watsun.cc.columbia.edu> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA27286; Fri, 31 Mar 89 11:01:38 EST Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 5027; Fri, 31 Mar 89 11:00:28 EST Received: by CUVMB (Mailer X1.25) id 2511; Fri, 31 Mar 89 11:00:25 EST Date: 03/31 10:11:34 From: FDCCU@cuvmb.cc.columbia.edu Subject: PUN file from RSCS - MOSGLA.MAIL X-Tag: FILE (4053) ORIGIN HLERUL2 MAILER 3/31/89 5:15:33 E.S.T. To: fdc@cunixc.cc.columbia.edu Reply-To: MAILER%HLERUL2@cuvmb.cc.columbia.edu Date: Fri, 31 Mar 89 16:57 CET From: "Johan van Wingen" To: "M. Sperberg-McQueen" , "F. da Cruz" , "E. Hart" Subject: overstruck characters Dear Character Overstrikers By way of attempt to convince you that there are good reasons for prohibiting composite characters in ISO 8859 I send here the revised version of my ISO paper (Ed, you have seen the first version). It are 670 lines. _ INTERNATIONAL ORGANIZATION FOR STANDARDIZATION ISO/IEC JTC1/SC2 N 1961R ISO/IEC JTC1/SC22 N 578R - September 1988 | Revised April 1989 0| VERSION 1.2 - CODED CHARACTER SETS AND PROGRAMMING LANGUAGES - Johan W van Wingen 0 Leiden, the Netherlands - Personal contribution - Table of Contents 0 0 Introduction 0.1 The Problem 0.2 Terminology, notations and conventions 0 1 Coded Character Sets 1.1 The birth of ASCII 1.2 Extension of the character set 1.3 Composite characters 1.4 Multiple-byte character sets 0 2 Languages 2.1 Computer data processing 2.2 Operating system considerations 2.3 Basic elements of the language |2.4 Problems of character representation 2.5 Non-English languages and Information processing 2.5.1 Linguistic skeleton of the language 2.5.2 Identifiers 2.5.3 Comments 2.5.4 Handling textual data in the program 2.5.4.1 Unrestricted strings 2.5.4.2 Restrictions on string content and their validation 2.5.4.3 The type "character" 0 3 Sorting considerations 0 4 Conclusions |4.1 Recommendations to SC22 |4.2 Recommendations to SC2 |4.3 Unsolved issues 0 Annexes 1 Entia non sunt multiplicanda praeter necessitatem. (Entities are not to be multiplied beyond necessity.) William of Occam 0 INTRODUCTION 0.1 The problem In recent years there has been an increasing demand for computer facilities that do not need the English language for their expression. In the field of International Standards this affects in the first place the work of ISO/IEC JTC1/SC2, Characters and Information Coding, because this committee develops the elementary tools for expressing everything dependent on language. SC22, Languages (for Information Processing) is one of the important users of these tools, and at the same time the primary target for requirements from non-English speakers. At its 1987 Washington meeting two resolutions were adopted, that formulated the principles of a future policy (see SC22 N 406, Resolutions 85 and 86). Up to now several papers have been produced on the subject, (SC22 N 113,357,410,403,410,444,460,470,509, SC22/WG10 N 130,204,208,211,213, 214), a number of them by the SC2/SC22 Liaison, Mr. Holka. These showed to SC22 that the SC2 matter is far from simple, and difficult to explain. In a reaction, on N 410 in particular, the Convener of SC2/WG3 complained of inaccuracies, of the use of a non-standard terminology, and of a general ignorance of the aspects of the work of SC2 (N 509). To resolve the issues he suggested a joint meeting of SC2 and SC22 delegates, which idea is to be acclaimed. The present paper is intended as a first contribution to the working documents for that meeting, and as a renewed attempt at illustrating the relations between the SC22 and SC2 products in a clear way, while acknowledging the valuable ideas and suggestions from Mr. Holka. This paper does not express any opinion of the Netherlands Member Body (NNI), not from any disagreement on the content, but because taking any position is considered premature at the moment. 0.2 Terminology, notations and conventions The terminology in this paper is that of the ISO standards in the field. The terms "bit pattern", "bit combination", "byte" are used almost as synonyms, "bit string" is not used. "Byte" is not restricted to 8-bit combinations. For those, "octet" is used instead. Bytes are denoted with the customary hexadecimal representation, but incidentally also according to the ISO convention (15/15 for FF). Where clear from the context, "character" means "graphic character". All graphic characters that are not letters or digits are called "specials". The terms "control character" and "control function" are used as defined in ISO 2022. Where "language" is used, it is in the sense of the SC22 scope, unless it can be derived from the context that "natural language" is meant. 1 1 CODED CHARACTER SETS 1.1 The birth of ASCII The idea of coding data is rather old. For several purposes it appeared necessary to represent texts or numbers in a form other than spoken or written. The Morse code was an important step in a long development, as was the Hollerith punched card. The idea of having holes as a unit of information, the bit, was very fruitful, and could be generalized for use on electronic media. As early as 1931 the 5-bit TELEX code (CCITT # 2) was adopted, introducing the concept of bit pattern, or bit combination. As main areas of application of representing data with bit patterns emerged in the course of time: 1. Storage of data. Numerical results of the census could be stored in punched cards and manipulated in a simple way. Sorting in particular became easy to do. 2. Transmission of data. Texts could be transferred by telex in an easier way than was possible by Morse code. 3. Processing data by a computer. When computers were developed, bit patterns played an essential role. Storage and registers were organized in "machine words", bit patterns of fixed length. Most popular were 24,32,36,48,60,64. Increasing use of electronic methods necessitated the adoption of standards, which had to serve the areas of application where data interchange was of primary importance. Thus ASCII, a 7-bit code |(characters mapped on 7-bit patterns) saw the light in 1963. An excellent description of the developments leading up to ASCII is found in the paper by Bemer (1) and the book by Mackenzie (2). ASCII provided codes (assigned bit combinations) for 94 graphic characters (26 letters, 52 after 1968, 10 digits and 32 specials), the SPACE and 33 control characters for control functions. The code table is in FIG 1. The control characters are in columns 0 and 1, the capital letters in 4 and 5, small letters (after 1968) in 6 and 7, digits in 3, SPACE at position 2/0, DELETE at 7/15, specials in the positions left over. ASCII was designed by its structure to serve the first two application areas well. -- By assigning to letters bit patterns in ascending order without gaps, a contiguous "collating sequence" could be defined, easily implementable on a electronic device. (The old telex code did not possess this property.) -- By providing codes for control functions and making them easily recognizable by putting them together in two columns of the code tables, ASCII was well suited for transmission of data, text in particular. 1 For internal processing by a computer ASCII was not very well adapted. A 7-bit machine word is hardly usable. For internal representation of codes 6-bit or 8-bit "bytes" were much better, as 6-bit bytes could be contained in a 24,36,48,60 bit machine word 4,6.8,10 times, or a 8-bit byte in a 32 or 64 byte word. Only DEC succeeded in putting 5 ASCII characters into a 36-bit word. It is no surprise that many computer manufacturers defined their own 6 or 8-bit coded character sets for their specific machine use. Particularly influential became EBCDIC |from IBM (FIG 1). ASCII has another important property (not present in the old TELEX code). Every character of the set has a unique code, and every bit combination has a unique meaning. The presence of 8-bit bytes in a computer poses a new problem. If we want to transfer collections of these outside the computer ASCII does not provide facilities. We may define certain 8-bit combinations as being equivalent to ASCII codes, but even then we are faced with the fact that there are 128 left without a clear meaning. For the interpretation of these we would need |what we could call a Standard for Charactered Code Sets. In other |words, a standard (as it is now) specifies a mapping of every |character it includes to a single byte. What we want is that it also |says how every byte shall be interpreted as a character. The essence of data communication by transmitting characters coded in deviation from ASCII is "a previous agreement between sender and recipient of the data". The problem with transferring computer data from and to storage is that it is mostly not clear who sender and recipient are. Not only should coding be defined, but also interpretation. A statement that certain "bit combinations shall not be used" is as sensible as saying "a program shall not contain errors". It is the task of the programmer to code his program correctly, but of the compiler to interpret the sequence of bytes, without stopping at the first unrepresented byte. 1.2 Extension of the character set It was clear from the start that ASCII deserved an international status such as could be achieved under the responsibility of ISO. Because countries other than the US have different requirements to the contents of the coded character set the approved document ISO R 646 contains options for a number of positions in the code table. Once exercised, the result is a National Version, ASCII being the US National Version. Unfortunately this implied that the principle of unique code-character correspondence was abandoned. With the rules of ISO R 646-1968 (revised in 1973 and 1983) it became possible to code texts in Danish or Swedish, which carry a 29 letter alphabet, at the price of losing 6 specials. Further needs such as accented letters (for French) or additional specials could not be satisfied. To this purpose an extension scheme was devised, standardized as ISO 2022. The idea is that different characters may be coded with the same bit combination. To indicate which character is meant, a control function SHIFT is inserted (several are defined) or a ESCAPE sequence with analogous effects. At reading (or receiving), each time a SHIFT or ESCAPE sequence is detected, the "state" of the reader changes, and a different code table is accessed. (Possible code tables may be a national version of ISO 646, or one registered by the Registration authority.) 1 ISO 2022 provides the means for coding an almost unlimited number of characters by a single, but not unique bit combination. It is not restricted to 7-bits, but was later extended to include 8-bit coded character sets, as soon as the structure of these was defined in ISO 4873. Because reading data encoded according to ISO 2022 requires a finite state machine with very many states, practical use never has been extensive. With the advent of hardware with 8-bit facilities partial solutions for the more urgent problems became feasible. Nevertheless, ISO 2022 supplies the general method in all cases where switching of code tables is unavoidable. Even for multiple-byte coded sets rules are defined. ISO 4873 specifies the structure of 8-bit coded character sets, but does not define a single code table. It fixes the content of some areas, but for the rest only options are given. For the purpose of ISO 2022 sets are identified with a set designation C0,C1,G0,G1. Control characters occupy columns 0,1 (C0), 8,9 (C1). Columns 2-7 (G0) are identical with those of ISO 646 (including its options). For 10-15 (G1), options for 94 or 96 characters are specified. Thus ISO 4873 is only a generic standard, providing for 188 or 190 graphic characters. 1.3 Composite characters In order to restrict the complexities of coding by the ISO 2022 method, especially where hardware does not allow midstream code table switching, other approaches for extending the available number of graphic characters were recommended. Some characters can be represented by combinations of several other characters. Following the practice of overprinting, ISO 646 allows creation of composite graphic characters by the use of BACKSPACE and/or CARRIAGE RETURN. But it warns (on p. 7): "According to clause 5 it is permitted to use composite graphic characters and there is no limit to their number. Because of this freedom, their processing and imaging may cause difficulties at the receiving end. Therefore agreement between sender and recipient is recommended if composite characters are used." To meet this pitfall, ISO 6937 follows a different approach. There are simple and composite graphic characters. Several characters are coded with a single bit combination (digits, specials, letters of the Latin alphabet and some additional ones). Others are coded by a double one: the first representing a diacritical mark (non-spacing), the second a Latin letter (spacing). Arbitrary composite graphic characters are not allowed. The number of graphic characters defined by ISO 6937-2 is restricted to those occurring in a "repertoire". Equally, not all "duples" are permitted, only those included in the repertoire. It is assumed that these duples can be displayed by hardware (the "character imaging device") as one single graphic symbol. ISO 6937-2 defines a |"primary" (G0) and a "secondary" (G1) set, which can be combined to |form the graphic character part of an 8-bit code (popularly, the left |and the right hand of the table). In this way a unique, but mixed single/double octet representation of characters is created. All European languages and several others can thus be represented. 1 ISO 8859 was developed for presenting a unique single octet representation of graphic characters. Because not all characters that are desired can be accomodated in a 94+96 code table, ISO 8859 is in several parts, each defined for a particular region of the world, serving the need of groups of languages (Europe: West, East, North, South; Cyrillic, Greek, Arabic, Hebrew). Each part contains the 94 (G0) from ASCII as a subset, suppleted by a varying 96 (G1) set |(FIG 2, IBM followed by extending its EBCDIC to contain the same 190 |graphic characters, FIG 3). Thus the code of ISO 8859 is only unique in a restricted sense. Where graphics from different regions are to be combined in a text, switching techniques from ISO 2022 are required. 1.4 Multiple-byte coded character sets Multiple-byte character sets have attracted a lot of attention, in recent times. From this it might seem that it is a clearly defined concept. But it is not, it is not even a new one. Four schemes have emerged as yet: --- That of ISO 646. Characters may be represented by 1,3,5 or more bytes, by use of sequences such as "char" BACKSPACE "char", and so on. --- That of ISO 6937-2. Graphic characters may be formed from diacritic plus letter, giving a mixed single/double byte representation. --- The scheme used by the Chinese, Japanese and Korean national standards. A graphic character is represented by two bytes, each taken from a 7-bit set with 94 positions, the same as is used for the graphic characters in ASCII. |--- A standard in development by SC2/WG2, to which the number DP 10646 |now has been assigned. All imaginable characters of the whole world (except cuneiform and hieroglyphs) are uniquely represented with 4 bytes per character. This is the price to be paid for doing without ISO 2022. | |Besides these four, several schemes have been invented and used in |Japan and China that mix single and double byte character |representations, in a bewildering variation. 2. LANGUAGES 2.1 Computer data processing Computer data processing consists, in its simplest form, of a program operating on data. Data has to be represented in bit patterns, that is, in machine words, or parts thereof that can be addressed by the hardware organization. Some of these parts may be considered as a |character, or better yet, as the internal representation of a single |character. Mostly, the bit patterns of these representations are not identical to those found in any ISO standard. | 1 A program is written in a language. It may exist written on paper, or even in the mind of the programmer. But as soon as it is prepared for input into the computer it consists of a sequence of character |representations, perhaps divided into lines by some device. Once stored in the computer, there is no intrinsic difference between a program and data. Both need representation. After the program has been compiled to an executable form, its representation has changed, but is still expressed in the bit pattern of the machine. What makes data a (potential) program is the place in storage, a matter of interpretation by the computer. 2.2 Operating system considerations Computers, at present, do not work by single programs only. It is the Operating System that, as one of its tasks, performs the data management. It assigns meaning to some data, and decides on the shape of output or on the validity of input, quite often outside the control of the program that is supposed to handle these. In fact, speaking in transmission terms, it acts both as "sender and recipient of the data". Because of that, it generally stores the description of the nature of the data outside the data itself, contrary to the practice in transmission ("announcers" being part of the data stream). Another aspect of the data management is the division of data into "records" (and sometimes "blocks"). This may be a different concept from that of "line", such as is defined by the language of the application program. 2.3 Basic elements of languages A program is, according to the language definition, always built from |basic elements. They may be called "characters" or "basic symbols", |but they are essentially of an immaterial nature. They may look like |letters or mathematical symbols, but sometimes they may never be used |outside that particular programming language, like some APL |characters. To be suitable for input into a computer they must be |transliterated into sequences (or perhaps lines of sequences) of those |characters the computer can represent. The transliteration rules, from |the abstract character (used in the language definition) to the |concrete character of the machine, are traditionally called "hardware |representation". The concrete character sequences on the input medium |are read in by the computer, and processed by the language compiler. |This external representation of the basic element will then be |converted again, now to the internal representation used by the |compiler, often an integral value. | The importance of this step is sometimes ignored by defining the character set of the language (if there is one thus called, otherwise the list of all the symbols needed for expressing a program in the |language) as being identical to that from an existing standard for |coded character sets, without indication for any extension. At this point the restriction of the elements to English expression creeps in. |Some designers of a language have even been so silly, that they use |brackets and braces, without substitutes, not realizing that this |precludes its use in Scandinavia, where these characters are replaced |by letters from the extended alphabet in use there. 1|2.4 Problems of character representation |Even with a single byte character representation, coded programs |generally have to be translated into the character set of the actual computer at input. This may be a one-to-one process only. But if characters are represented by a varying number of bytes things grow worse. This is aggravated when the hardware representation is not unique. APL uses graphic characters not found in any ISO standard. |These can be produced as composite characters using BACKSPACE (in the |following abbreviated to $). Now, if we want an underlined capital A, we can write A$_ or _$A (with ISO 646), or _A (with ISO 6937-2). Which of these is acceptable is a matter of hardware representation. |One may do, or both, or several others. | In ALGOL 60 end is a single basic element, that can be written as + ___ e$_n$_d$_ or _$e_$n_$d, but also as end$$$___ or ___$$$end, with a surprising number of other combinations possible, (ISO 6937-2 allows |only _e_n_d, which is an improvement). If all of this is permitted we have created the "line reconstruction problem", which was solved in the ALGOL 60 compiler for the Ferranti ATLAS (4). Few people are prepared nowadays to accept these complexities. Should an ISO 2022 style of coding be permitted, including code table switching in the middle of program lines, then designing an adequate hardware representation scheme requires real genius. |There is another point to consider. In certain parts of a program |literal use of characters is needed (in strings for example). They |have to be stored and handled as such by the compiler, and thus an |internal representation is required, in contrast with the external |representation in which they are being read in. If they are always |being coded with the same number of bytes both representations can be |made identical. Otherwise many implementers may have to resort to an |internal mapping on integers. Then it would be difficult to use the |hardware of the present day octet-machines efficiently. If integer |arithmetic is to become needed for handling characters, the clock has |put just backwards to the situation of 25 years ago, when Fortran and |Algol provided for character processing this method only. The conclusion must be drawn that, other than in exceptional situations, only coded character sets that are unmixed and unshifted, |not permitting the use of BACKSPACE, are acceptable for coding a program text. Otherwise, strict and perhaps complex rules are required to ensure a unique hardware representation. These sets are ASCII, and the single parts of ISO 8859, that is ISO 4873 without shift (called Level 1). A consistent double-octet scheme, such as found in the "west-pacific" standards, may also be considered. 2.5 Non-English languages and Information Processing Traditionally, the Information Processing world is English speaking only. Now that the access to this world is no longer reserved to an intellectual elite, this practice has become a untenable barrier to large groups of people. The extent of the problems has been excellently summarized in the SEAS White Paper on National Language support (3). With programming languages we distinguish four areas requiring attention. 1 2.5.1 Linguistic skeleton of the language Almost every language has a number of elements looking like words from the English language. Some have been assigned a fixed role, some may be chosen freely. Those that are fixed, together with some specials (brackets, separators, delimiters), constitute the linguistic skeleton of every program. According to the specific language definition, they may be called "word symbols", "reserved words" or "keywords". They are supposed to show a program as a running text in English. It is clear that it is not generally possible to translate every single word into another natural language without violating its syntactic rules. Some |statements may even become ambiguous. Language definitions explicitly |containing provision for expressing programs in languages other than |English are scarce, that of ALGOL 68 being the most notable. The ALGOL |68 Report has been translated successfully into Bulgarian and Chinese |(5). But in general it may be advisable to keep the word symbols as |they are in English. 2.5.2 Identifiers For naming quantities, variables, labels and so on, "identifiers" are commonly used. These word-like constructs are mostly defined as starting with a letter, and continuing with letters, digits and, sometimes, the low line. The problem lies in the definition of "letter". In the beginning only capital letters were allowed, but even after adding 26 small ones the Scandinavian languages cannot be served. As compilers to a large extent are US products (or written elsewhere with an eye on the US market) they use ASCII or EBCDIC. This means that characters from a national version of ISO 646 are interpreted as specials (brackets etc.). Applying an 8-bit code like one from ISO 8859 is the way out. But even then, the compiler has to be told which part of 8859 the program makes use of, because what may be a letter in one part may be a special in another (FIG 4). Mixing characters from several parts from ISO 8859 requires invoking the help of ISO 2022, which a compiler writer would not like. Checking whether a byte is meant to be a letter would be easier if the letter areas of ISO 8859 would have been contiguous. Instead of that, quite obsolete characters for multiply and divide, for which * and / are used in programs for more then 25 years, have been inserted in the middle of a column. A look-up table is required to decide whether a character is considered a letter or not. This cannot be avoided anyway if a double-octet representation is used. 2.5.3 Comments |In general a comment does not present problems when containing any |byte, if it is only clear where it stops (or begins). If a ";" |(semi-colon) is defined for stopping, and the hardware representation transliterates this as ".," unintended effects may occur, especially if spaces are ignored. Besides this, it does not matter which |characters the bytes represent, even if some of them cannot be |displayed properly. 1 2.5.4 Handling textual data in the program In order to produce output of text, means to handle its elements must be provided. The usual method is a "string", also called "text constant" or "text literal". A string need not consist of single characters; it may even be nested, in some languages. If we take the simplest form, it remains to be specified what a string can contain: "graphic characters", or "anything that is allowed by the processor", say "bytes". |-- If "graphic characters", some may be represented by a single byte, some by a double. There may be a SHIFT character in it, causing a change of character set from capital to small letters, or to a different national alphabet. All this can be implemented, and has been, as early as 1962, in the Dijkstra ALGOL 60 compiler for the X1. However, problems arise when operations on strings are introduced, not provided in the definition of ALGOL 60. |-- If "anything", bytes may be all 6-bit, or all 8-bit, as with the 48-bit word Burroughs computer, where a word may contain 8 6-bit bytes, or 6 8-bytes, necessitating the inclusion in a program of a more precise description of the kind of string that is meant, with a type indicator. Only if the program is coded according to a standard, that is, by defining a unique relationship between all bytes and all characters, it becomes possible to have characters and bytes indiscriminately as elements of a string, and to introduce counting the number of those |elements without ambiguity. Internal and external representation can |be made identical. If not, the character count can be different from |the byte count, and string parsing is necessary. There remains the question, even with a single octet character representation such as that from ISO 8859, what to do with a byte that is not "printable", because there is no graphic character defined for it. It should be remembered, however, that actual printing is outside the control of the application program, and left to the supervision of the operating system, which may process options as to the selection of a printing font, or a coded character set. This may result in something quite different from that in which the program originally was coded (FIG 5,6). FIG 5 shows a little program (in SNOBOL) that converts Greek text in Latin transliteration to one in single Greek letters (even with diacritics). It is printed with the normal printing font which does not include these Greek letters, which are thus invisible (though present) in the result string (HEXALL). But exactly the same program can also be printed (FIG 6) with a Greek font corresponding to a Greek character set, which shows the contents of the string clearly, but the identifiers all in Greek. To the compiler it does not make a |difference, because all the bytes are the same. |In FIG 7 it is demonstrated that if a proportional script is used, a |table layout may be completely obscured. |In that style a program cannot be understood from its printout only. It is the task of the operating system as well, to deal with the control characters or sequences in the text. There are two aspects to be envisaged: presenting the program text for inspection and understanding, and specifying certain actions to be performed by the output. | 1 As ISO 6429 specifies, the control functions indicated may be disabled, and interpreted as graphic characters, by changing the "mode" of a device at detecting a specific control function in the data stream, and restoring the action by another. ISO 2047 specifies graphic representations for the control characters of ISO 646. Thus every byte of a program can be shown, if only the ISO standards have |been implemented. Modern terminals like the DEC VT340 allow setting |the "mode" at will, and then display on the screen the text according |to the option chosen. As for specifying actions in a program regarding the output, the desired effect may be realized either by putting a certain character sequence in an output string including control functions, or by calling a library function. In fact, both methods should be available, because neither can cover all situations. There will never be enough library functions to create the effects to be caused by a specific byte sequence. On the other hand, for example, detection of a NEW LINE character by the operating system may not result in the effect intended. Requiring transfer to a new output line by CALL NEWLINE may cause, in a fixed length record environment, the storing of the current line, padded at the right by spaces up to the required length. In a variable length record environment, it may put the number of bytes in the current line before it, and store the whole. Or it may simply store the current line with a new line character attached at the end. (The use of the first byte of a record for printer control is not considered here for simplicity.) All this should be kept outside the control of the application program, and thus not defined by the programming language. 2.5.4.1 Unrestricted strings If these problems have been sorted out, and a string is to contain octets only, without any restriction, regardless of their meaning in any code, operations on strings do not pose difficulties. A type "string" can be introduced, with string constants and string variables. A LENGTH function can be provided, substrings can be defined, starting from a given element number to another, and |concatenation is possible. The string can behave like an octet array. Because any stream of octets can be produced, files can be prepared and sent, which can be read by the recipient according to the rules of ISO 6937-2, or even ISO 2022. No special provision is required for double-octet character sets, if these are carefully designed like the Chinese, Japanese and Korean sets. The string 'STANDARD' will then be printed without hesitation by a Japanese printer as four Japanese characters (of course with a quite different meaning). 2.5.4.2 Restrictions on content of strings and their validation It is only if certain restrictions are put on the contents of strings that things become complicated again. This may happen if it is required that a certain string have an even length, because double-byte characters will be put into it. Also, restrictions may be introduced regarding the validity of some octets (making illegal those that are not "printable"). It is imaginable to define string types that have the desired property, and have them checked for syntax by the compiler. But library functions can perform the checking for validity on string arguments equally well, without further complicating the string syntax of the language. 1 2.5.4.3 The type "character" Several languages know a type "character" for a single byte string (the CHARACTER type in FORTRAN is in fact a string type). Longer strings may be defined as character arrays. A function ORD can be defined with a string argument, giving an integer value, the byte converted to decimal. Conversely, a function CHAR with integer argument may deliver a character. This scheme presupposes a single byte and unique character representation. A double-byte but unique code requires an appropriate new type, and new functions, but no special tricks. All others cause a mess. 3 SORTING CONSIDERATIONS The topic of sorting belongs only partially to the subject of programming languages. It is only because some of them know the concept of "collating sequence" that it is dealt with here. Historically, it was of utmost importance that numbers could be sorted on base of their bit representation. Also, in the period of capital letters only, putting words into alphabetic sequence could be performed based on a collating sequence defined by the code table. But when requirements became more subtle (as with a telephone directory or a lexicon) bit patterns were only of little help. Thus the merit of having letters contiguously in a code table has become increasingly insignificant. Non-English languages may even have a varied order for the same accented letters, or assign letter combinations an unexpected place in their alphabet. Further discussion can be found in Mackenzie (2) and in the SEAS White Paper (3). One aspect should be pointed at, however. Many compilers are able to produce an identifier list from the program, alphabetically sorted. If identifiers are to contain characters other than those from ASCII, it is unclear how these are to be sorted. It may be that the order from the chosen part of ISO 8859 is kept, but that may not correspond to |national usage. But one should not confuse "ordering" with "sorting". |Only with one-case Latin letters sorting can be directly derived from |the ordering according to the numeric value of the codes, in all other |cases a key transformation is needed. If some identifiers are to be read from right to left (Hebrew, Arabic) more problems turn up. 4 CONCLUSIONS |Many of the comments on the non-English or multiple-octet issue one |finds in the literature (even in ISO documents) are too imprecise or too incomplete to be really of use to the language standard developer. In the preceding lines an attempt has been made to clarify the issues which depend on coded character sets. The actual work of providing solutions has to be done by the SC22 Working Groups itself. |Nevertheless, some recommendations may be given for consideration |either by SC22, or by SC2, or by both. |< recommendations still under revision, will be released later > Annexes are not included for space reasons, and special characters. 29-May-89 16:49:25-GMT,2703;000000000001 Return-Path: <@cuvmb.cc.columbia.edu:ISO8859@JHUVM.BITNET> Received: from cunixc.cc.columbia.edu by watsun.cc.columbia.edu (4.0/SMI-4.0) id AA26983; Mon, 29 May 89 12:49:23 EDT Message-Id: <8905291649.AA26983@watsun.cc.columbia.edu> Received: from CUVMB.COLUMBIA.EDU (cuvmb.cc.columbia.edu) by cunixc.cc.columbia.edu (5.54/5.10) id AA11207; Mon, 29 May 89 12:48:37 EDT Received: from CUVMB.CC.COLUMBIA.EDU by CUVMB.COLUMBIA.EDU (IBM VM SMTP R1.2) with BSMTP id 0097; Mon, 29 May 89 12:48:32 EDT Received: from PSUVM.PSU.EDU by CUVMB.CC.COLUMBIA.EDU (Mailer R2.03B) with BSMTP id 3204; Mon, 29 May 89 12:48:31 EDT Received: by PSUVM (Mailer R2.03B) id 7252; Mon, 29 May 89 12:44:05 EDT Date: Mon, 29 May 89 17:31:00 CET Reply-To: ASCII/EBCDIC character set related issues Sender: ASCII/EBCDIC character set related issues From: Johan van Wingen Subject: CP850 vs ISO 8859-1 To: Frank da Cruz Date: Mon, 1 May 89 16:15 CET From: "Johan van Wingen" To: "E. Hart" Subject: ISO equivalent of CP850 Dear List Subscribers There is an issue not sufficiently discussed up to now. It is proposed replacing CP850 by ISO 8859-1. I applaud this, because CP850 is a miserable misconstruct. But, as with CP437, it has 256 graphic characters, where ISO 8859-1 has only 190 (SPACE not included), with 65 positions reserved for control characters. Thus both are not equivalent. Even if we prefer the more logical distribution of graphics over the code page of ISO 8859 to the chaos of CP850, we have not said anything about filling the four empty columns (0,1,8,9). Our attempt at having a 254 graphic set as an extension of ISO 8859-1 has failed for the moment. The question is what users want: 1. An additional 64 graphics 1.1 on the PC only 1.2 also under VM or MVS (what to do with the controls) 2. 64 controls only, either on PC or mainframe 3. bytes interpreted as controls or as graphics, depending on "mode set" (this is more or less available with MS/DOS, but with CP437, CP850, and also with DEC VT340, not with 3270 terminals as far I know) The question is, if we want 64 extra graphics, which should we select. There is no guidance in any ISO standard. I am rather concerned about this, because I am thinking on presenting a new attempt for a 254 graphic set, and before showing it anybody, comments would welcome contributing to its content. FROM J. W. van Wingen MOSGLA@HLERUL2.BITNET Mail to P. O. Box 486, 2300AL Leiden, Netherlands