Kenneth R. Beesley
5 September 1997
Copyright © The Document Company - Xerox 1997. All Rights Reserved.
From 1984 to 1990, I worked for ALPNET (né ALPS), a company in Provo, Utah, USA that produced linguistic software of various kinds, including multi-lingual word processors, grammar checkers and, in particular, computer aids for natural-language translation and comprehension. My own activities included work on systems for French, Spanish and Portuguese, and the design and implementation of lingware for syntactic parsing and transfer.
In 1988, ALPNET rather naively decided to produce a comprehension aid for Arabic--naively because they assumed it was just another language like the European languages that had been attacked previously. The core component would be a morphological analyzer, and the lexicon would include short English "glosses" to be returned to the user as part of each solution. The goal was to produce an interactive system that would allow a monolingual English speaker to get the gist of an online Arabic text.
Initially I had absolutely nothing to do with the Arabic project. As was customary at ALPNET, a rather primitive cut-and-paste approach to the morphological analysis was anticipated; but while previous ALPNET systems had been hacked directly in C, those associated with the Arabic project had, to their credit, designed a new language in which they would describe the cut-and-paste operations in a higher-level rule format.
After our only Arabic speaker left the company, the Arabic project fell into my lap, and I knew that we were in trouble. I didn't speak any Arabic at all (and still don't), but I knew enough about the language to know that the root-and-pattern morphotactics and the absence of short vowels and other potentially helpful diacritical marks in the normal orthography presented the biggest morphological challenges that we had ever faced. And I had no confidence in the cut-and-paste approach, no matter how it was dressed up.
Then in August of 1988, a number of unexpected developments made the Arabic project possible. First, I was astounded when the always impecunious ALPNET suddenly decided to send me and my colleague Deryle Lonsdale to the COLING conference being held in Budapest, not only for the conference itself but for an extra week of seminars. I had not expected to go, and I had no idea what the conference might offer. Second, on arriving in Budapest, I found that seminars were being offered by distinguished researchers from the Xerox Palo Alto Research Center (PARC), and I signed up for a Common-Lisp seminar by Martin Kay and a seminar in Morphology by Lauri Karttunen.
Karttunen's seminar constituted my first exposure to finite-state morphology in the extremely clever but limited implementation known as Two-Level Morphology. Presented first in Kimmo Koskenniemi's 1983 dissertation, Two-Level Morphology had been popularized mostly by Karttunen and his students at the University of Texas. Fighting persistent jetlag and the summer heat of Budapest, I managed to stay awake long enough to grasp that Two-Level Morphology might just help me do Arabic, if only I could overcome one of its principal shortcomings, the limitation to concatenative morphotactics.
After returning to Utah, I acquired a copy of Koskenniemi's dissertation, Karttunen's University of Texas papers, and a Common-Lisp system for my Mac SE, at that time a respectable machine. And then I virtually disappeared for two months to educate myself; I literally read Koskenniemi's dissertation three times. There was unfortunately no implementation of Two-Level Morphology available to me in 1988; the excellent PC-KIMMO system and book by Evan Antworth of the Summer Institute of Linguistics would not appear until 1990. Xerox PARC had an Interlisp implementation, including an automatic rule compiler, but all attempts to license it for commercial purposes were in vain. But following the descriptions in the Karttunen papers, I reimplemented a Two-Level Morphology framework from scratch in Common Lisp (my first Lisp program), and I taught myself to hand-compile two-level rules into transducers.
The biggest challenge was to handle Arabic's root-and-pattern morphotactics, which simply is not anticipated in classical Two-Level Morphology, where the only morphotactic process is concatenation. While the formal analysis of Arabic morphology is contentiously theory-dependent in its details, there is crude agreement that stems in Arabic consist of a consonantal "root", typically of three consonants like drs or ktb, which "interdigitates" with patterns like CaCaC and CiCa:C to form stems like daras, katab, dira:s and kita:b. Perhaps 5000 roots are commonly used in modern Arabic, and the language provides about 400 phonologically distinct patterns, each root combining with an idiosyncratic and often very small subset of the patterns. The interdigitated stems are then made into complete words with prefixes and suffixes that concatenate with the stems in the usual boring way.
The solution I devised for Arabic in Two-Level Morphology was to place roots and patterns in separate sublexicons, and to write Lisp code that performed the interdigitation of stems at runtime. This operation, dubbed Detouring, simulated the formal finite-state operation of intersection. The restrictions governing which roots could combine with which patterns (and other dependencies within words) were imposed via an auxiliary feature-unification mechanism. On these bases, and using data gleaned from McCarthy's influential 1981 paper, I emerged from my two months of Two-Level Immersion with a working and stable prototype, small of course, but thenceforth expandable via appropriate additions to the rules and lexicons. Armed with this prototype, I convinced the bosses to abandon all the previous cut-and-paste work and let me continue in the two-level framework.
At that point, I absolutely needed a real Arabist and lexicographer. Professor Dilworth Parkinson of Brigham Young University recommended Tim Buckwalter, then at the University of Indiana pursuing a Ph.D. with a dissertation on Arabic lexicography. It turned out that Buckwalter had lived seven years in the Middle East, had literally read through the complete Hans Wehr dictionary (comparing the 3rd and 4th editions word by word), and had as his great goal in life to be an Arabic lexicographer. His knowledge of the Arabic language and orthography extended down to the last jot and tittle. This was my man, and I was lucky enough to convince him to join ALPNET and move his family to Provo.
Luckily Tim also turned out to be a great friend, for we were destined to work together closely for over a year. Tim eventually took over the dictionaries completely, and he supplied me with hundreds of typed-out two-level examples that I analyzed to formulate the two-level rules, which I then compiled into transducers by hand.
Tim also supervised two part-time lexicographers, Osama Shabaneh and Derek Foxley, and did all the testing; while I handled the auxiliary C programming to perform automated testing, sort solutions, present the English glosses, etc. We were also ably assisted by Stuart Newton, who did all the systems programming necessary to implement the final system as a virtual machine running on an IBM mainframe. Stuart also gamely tried to write an automatic two-level rule compiler, and it was only years later, helping Lauri Karttunen rewrite the Xerox rule compiler, that I realized how unrealistic that effort had been.
We continued to develop Arabic in the Common-Lisp implementation for about six months, until we had developed a much faster version in C on the IBM mainframe. (To acquaint Stuart Newton with two-level theory, we also used the Common-Lisp implementation to develop a significant prototype of a morphological analyzer for Aymará, an Amerind language in which Newton is proficient.) Work continued throughout 1989 and into 1990, when the project ended.
During this period, we presented three papers on the Arabic system at conferences. The first,
Beesley, K.R., "Computer Analysis of Arabic Morphology: A Two-Level Approach with Detours," in the Proceedings of the Third Annual Symposium on Arabic Linguistics, University of Utah, Salt Lake City, Utah, 3-4 March 1989.
described the prototype system, based on an enhanced version of two-level morphology implemented in Common-Lisp. The second,
Beesley, K.R., T. Buckwalter and S.N. Newton, "Two-Level Finite-State Analysis of Arabic Morphology," in the Proceedings of the Seminar on Bilingual Computing in Arabic and English, Cambridge, England, 6-7 Sept 1989. No pagination.
again described the prototype, going into some detail on the feature-unification mechanism. The third and most important paper,
Beesley, K.R., "Finite-State Description of Arabic Morphology," in the Proceedings of the Second Cambridge Conference on Bilingual Computing in Arabic and English, 5-7 Sept 1990. No pagination.
described the final C-language implementation. These two later papers, appearing only in the procedings of the Cambridge conferences, have remained fairly obscure. The very first paper, on the other hand, was eventually reprinted in 1991 in
Bernard Comrie and Mushira Eid (eds.), Perspectives on Arabic Linguistics III: Papers from the Third Annual Symposium on Arabic Linguistics, Amsterdam: John Benjamins, 1991. pp. 155-172.
and thus became the most accessible and best-known description. This has unfortunately left the impression with some reviewers that the ALPNET Arabic project never got out of the prototype stage.
At the end of the Arabic project, ALPNET was changing its business, moving out of software development and into traditional translation services. I left the company to join Microlytics, a Xerox spinoff headquartered in Rochester, New York, and moved to Silicon Valley as the Microlytics liaison with Xerox PARC. I got this job without doubt because of a positive recommendation from Lauri Karttunen, with whom I had kept in contact.
Over the past six years, the fate of the ALPNET Arabic system has often been uncertain. It was sold at least once, and ALPNET still retains the commercial rights. Several reports indicate that a copy of the system was acquired by the U.S. Army Research Institute and that it is being used in an Arabic-teaching system being developed at the University of Maryland. ALPNET made some attempts, unsuccessfully, to license the dictionaries to other software developers.
Bit-rot inevitably set in. Despite my best attempts to archive the system securely when I left ALPNET, placing all the documentation, and backup tapes in a sealed box covered with curses and pleas, it was almost lost a couple of times. Someone removed and lost the documentation around 1992, but it was eventually rediscovered on a bookshelf in the ALPNET office. When I approached ALPNET in 1995, hoping to license the system for use at Xerox, the box was eventually found in a storage shed, but the backup tape had been removed; whether lost, erased or stolen, it has never been recovered. What remained were hardcopy printouts of the two-level rules (not the final versions), Tim Buckwalter's work-at-home PC diskette copies of most of the lexicons (also not the final versions), and hardcopies of the documentation and examples.
The surviving components were licensed from ALPNET in late 1995, and the materials came into my possession at the Xerox Research Centre Europe in Grenoble, France in January of 1996. With lots of long-distance email consulting from Tim Buckwalter, we managed to fill in the gaps, and I completely redesigned and rebuilt the system using Xerox Finite-State Technology. The rebuild was reported in a paper read at COLING 96.
Beesley, K.R., "Arabic Finite-State Morphological Analysis and Generation" in COLING-96 (The 16th International Conference on Computational Linguistics), Proceedings, vol. 1 (5-9 August, 1996), Copenhagen:Center for Sprogteknoligi, pp. 89-94.
The original ALPNET system, like most two-level morphology systems, was much more suitable for analysis than for generation. Arabic is especially problematic in this respect because the optionality of diacritical markings makes it desirable to analyze all possible spellings of every word, but to generate only a single "fully voweled" form. Between August 1996 and August 1997, I rewrote the rules extensively to more reliably generate fully voweled forms as part of each solution. In addition, while it is theoretically possible to write a finite-state morphological analyzer using only two levels, in practice, at least for Arabic, it often a significant nuisance, resulting in less flexibility, less perspicuity, and greater difficulty in testing. Some of the nastiest rules in the original ALPNET system, including the notorious hamza-realization rules, were considerably simplified in the new system by breaking it up into few more levels.
Another problem of the original system, even as rebuilt in 1996, was that input and output were done using a Roman transliteration (Buckwalter Transliteration). Although this was a carefully designed "strict" transliteration, with an unambiguous and reversible one-to-one mapping between the transliteration symbols and encodings like UNICODE, it was simply not convincing to many naive observers. Most non-linguists cannot distinguish between the language itself and the orthography used to represent the language; in short, if they couldn't see the Arabic squiggles on the screen, they assumed you weren't analyzing real Arabic at all.
Realizing that there was a psychological and esthetic problem with using a Roman transliteration, no matter how formally legitimate it might be, I wrote Java applets in 1997 that allow users to interact with the Xerox Arabic system using real Arabic orthography from any Java-enabled web browser.
Although Tim Buckwalter has contributed crucially as the consultant, via email, on the rebirth of the Xerox Arabic Morphology system, his other activities (a normal job and additional Arabic consulting, including the editing of printed dictionaries) have precluded his spending as much time as we would both have liked on testing and responding to my endless questions. And unfortunately, we currently have no one at the Xerox Research Centre Europe who is qualified and available to do such testing. Rather than sit on the system any longer, we decided in 1997 to make it available on the Internet, both as a service and as an experiment, hoping that user feedback would help us identify and correct the remaining problems. Once a problem is clearly identified, it is usually easy to fix.
|Rate this service :|