ManManLai Development Process

31 Jul 2011

When we started developing ManManLai Chinese, we set out to create a single application for both elementary and advanced Chinese language learners. We envisioned an app that would stay with a learner throughout his study. As her skills and fluency improved, the app would be able to provide more advanced vocabulary and capabilities.

We also envisioned an app built on a fully cross-referenced internal information model making it incredibly easy for a user to explore the world of Chinese, in all of its dimensions.

What do we mean by dimensions? First, let’s consider some basic concepts.

The fundamental unit of information in Chinese is the character, and two or more characters are often combined into compounds representing words, phrases and idioms. But Chinese characters are not like letters in an alphabetic language. A character stands on its own as a meaningful piece of information, and so students must eventually master not only thousands of compounds, but also thousands of individual characters comprising those compounds.

The challenge of modeling characters and compounds in an information architecture may be fairly tractable in itself, but some other important considerations complicate the problem:

  • Characters can have more than one pronunciation.
  • Pronunciation is represented in Western alphabets using the standard Hanyu Pinyin romanization scheme.
  • Many characters and compounds share the same pronunciation.
  • Characters can be organized by internal structures called radicals.
  • Characters can be organized by the number of strokes needed to draw them.
  • Some characters and compounds are rarely used, and some are frequently used.
  • Characters used in mainland China are simplified variants, and the ones used in Taiwan and Hong Kong are traditional ones.

Apart from simple dictionary look-up mechanisms, we envisioned an app making it incredibly easy to discern relationships and patterns: What does this character mean in this compound? What other compounds use this character? What other characters or compounds share the same pronunciation? And so on…. Those are the kinds of questions one can answer with ManManLai.

Given all of these considerations, it became clear that a completely unified Chinese-English corpus, sufficiently indexed and internally cross-referenced, would be required to expose the language in all of its dimensions.

The solution was further complicated by the need to synthesize numerous open-source and proprietary data sources into one whole. Each data source had its own organization and formatting, dirty and incomplete entries, and internal inconsistencies. In some cases, the organization was ad-hoc or only implicitly documented. A system therefore would be needed to manage this and allow for rapid, iterative evolution.

An extendable framework written in Java, with a smattering of Clojure, was built to achieve these goals. The codebase is covered by JUnit tests. It utilizes Natural Language Processing (NLP) techniques, such as tagging, and pattern matching to generate metadata describing the entries in the corpus and to construct a consistent representation from the heterogeneous data sources.

The framework represents the nucleus of the system, and it’s curious how so much of ManManLai’s development revolved around an outside code base only indirectly related to Objective-C, Cocoa, and iOS.

In fact, the Java framework also generates the Sqlite database schema built into the iPhone application. In some cases, to address performance and resource constraints on the phone, Objective-C code is generated by the Java framework and compiled directly into the product.

Underpinning this system is a practical, documented, and entirely reproducible build process driven by UNIX shell and Ant scripts and integrated into the source control management system. It even drives Xcode. The process fully supports iterative upgrades to the data sources and development of new features, while maintaining referential integrity within the corpus and derivative user-generated data files, from version to version of the app.

These pragmatic and iterative approaches proved to be especially well suited for approaching an unfamiliar and complicated problem domain. Without the infrastructure in place to manage two code bases and numerous, changing data sources, we would not have been able to proceed forward, adding and changing features, while ensuring a high-quality product.

You may employ the same engineering discipline by hiring MSD Services for your project. Contact MSD Services today.