Feed on
Posts
Comments

This is the first in a series of posts intended for developers working on Chinese language processing tools, such as text to speech or bilingual search applications. Our goal is demonstrating how to use Adso to make your own application smarter and smaller, and allow you to focus on solving your domain-specific issues instead of re-inventing the wheel by writing another Chinese language parser. So in this post we’re focusing on integrating Adso with other C/C++ applications.

First things first, download Adso, unpack it and enter the /source subdirectory. Assuming you have a Unix system, you can compile and install the software as follows:

  1. ./prepare_internal
  2. make
  3. make install

This compiles the “internal” version of Adso which includes our database in the actual binary. The reason you probably want to do this (instead of compiling the MySQL or SQLite version) is that this lets you compile your application statically without any external dependencies. If you want to treat Adso as anything except a black box you can always use the MySQL or SQLite versions which make it easier to manipulate database content in real-time. There are instructions on installing those versions as part of our distribution so moving on, now that the software is installed you should be able to test that it works:

./adso –help

We’re not interested in compiling the engine from scratch though, we’re interested in building new and sexy applications that take advantage of Adso to do incredible things with YOUR software. So enter the /libsrc subdirectory and open the file “main.cpp”. You should see something like this:

#include <iostream>
#include “../adsointerface.h”

int main(int argc, char **argv) {

AdsoInterface *adsoInterface = new AdsoInterface();

std::cout << adsoInterface->pinyinize(”Adso是一个自然语言分析系统”) << std::endl;

delete adsoInterface;

}

If you compile and run this software (type “make” in the subdirectory) you’ll see that it prints out some segmented pinyin. How much easier can we make it. Just remember to note a few things:

  • We point the software to our header file “adsointerface.h”. The location of this file on your system will vary depending on where you unpackaged Adso. Since we’re in a subdirectory of the project we’re just referencing the version in the parent directory right now.
  • Somewhere in our application we create a new AdsoInterface object.
  • We pass our text to this object in the form of a string, and get our results returned in the form of a string. AdsoInterface supports three functions right now, each of which take a string as input and return a string as output. translate(std::string x), pinyinize(std::string x) and segment(std::string x).
  • Remember to delete your AdsoInterface object when finished with it. You can reuse the object to process multiple pieces of text. Don’t create a separate AdsoInterface object each time you want to convert a snippet of text to pinyin or segment it.

Now that we know how to black-box things like pinyin conversion or translation, it’s important to point out that Adso affords you a tremendous amount of flexibility in the way it processes Chinese text. So even though it is really easy to integrate Adso, it is possible to really customize the software. You can even create your own highly arbitrary rulesets which selectively manipulate and process the text. We’re going to get into specifics of how to accomplish this in some later tutorials.

For now the best place to get started is actually to open the file adsointerface.cpp and look at what the code is doing. In order to give a very short example though, let’s solve a specific problem: specifying input and output text encodings. By default Adso will evaluate the text to determine its encoding and script and return content in whatever encoding and script you feed it. Sometimes you may want to specify an input or output encoding though. This also marginally speeds up the software, so is useful to know it any case. And in this case, changes can be made by accessing the master Text, Parser, and Encoding objects held in the AdsoInterface object. To specify the input and output encodings and scripts, for instance, you can add something like this to your code after initialization:

  • adsoInterface->encoding->input_encoding = 1;
  • adsoInterface->encoding->input_script = 1;
  • adsoInterface->encoding->output_encoding = 2;
  • adsoInterface->encoding->output_script = 2;

This forces the software to treat incoming text as the gb2312/18030 encoding (1) in the simplified script (1), and treat output text as traditional script (2) in the utf-8 encoding (2).

Enough for this post. In future tutorials we are going to talk about solving specific problems. But enough for now. If you are a developer of related applications and have problems or questions, please feel free to write us. If you are a user and wish to help us in our efforts to promote this sort of resource, I’d encourage you to check out what we’re doing with Popup Chinese and consider joining our data generation efforts by using and helping to edit our online Chinese dictionary.

If you’ve subscribed to this blog you probably already know that Adso is an open source engine providing Chinese text segmentation, hanzi-to-pinyin conversion and Chinese-English machine-translation services. If you’ve just stumbled here through Google you can get a feel for how the software works by visiting our online demo at Popup Chinese. This post marks the release of our latest version of the software, which is available for download here.

You can use this version free-of-charge for machine translation, text segmentation and hanzi to pinyin conversion provided that you provide clear and prominent attribution of our work. If you wish to use the software for other commercial purposes you still have to ask nicely, but we are nice people and will probably say yes as long as your usage helps the community. Developers are invited but not required to share changes with us. And edits to the backend dictionary continue to be welcome through our online Chinese English dictionary. If you get into development please contact us so we can enable you as an approved dictionary editor on the site. This will speed up the review process.

In other recent changes, installing Adso from source will now also produce a standalone system library (libadso.so of course) that can be easily used by third-party applications to offload Chinese text analysis to the Adso engine. This makes it very easy to build dictionary reference, translation, and even text analysis programs or incorporate these features in other applications without actually needing to code them from scratch. If you’re interested in this sort of thing, the easiest way to get started is to download Adso and start hacking. I’ve put a sample program in our /source/libsrc directory that demonstrates how to do this.

In the next few days and weeks, I’ll be publishing a few short and technical guides on how to integrate Adso with other applications in order to accomplish very specific tasks. In my next post, I’ll put up a quick guide to using our new AdsoInterface class to interface with the software through a standalone C/C++ application. In the meantime, everyone is welcome to download the latest release and try it out for themselves. Feedback is always very welcome, and changes are encouraged through our sexy, user-editable and free online Chinese dictionary.

Happy New Year everyone!

 

If you’re considering taking the HSK and want a way to prepare for the exam without needing to be online (or if you want as much HSK goodness as you can fit into a small binary package), be sure to check out our Popup Chinese HSK test preparation software.

 

As mentioned on our release note on the site, this is Windows application I’ve just created that will shuffle you through a gauntlet of HSK questions. If you don’t have Chinese friends or family who will objectively and routinely tell you about your failings in their language, this is a great substitute. It’s an effective way to find out about your problems (and fix them!) before you get called to the Great Hall of the People to accept the gratitude of the nation for being such an awesome individual and helping support Popup Chinese.

 

And yes… that is actually relevant. As we mention on the site,  you will need to have either a current paid subscription, or have paid us sometime in the past to register the application and download the hundreds of additional test questions we’ve got (with more coming each week) that make the application so very useful. Even if you don’t upgrade, I’d encourage you to check it out though. Comments and feedback are appreciated and we’ll work to improve this to make it even better.

Dajiudian.info has apparently launched - the news slipped into my RSS feed yesterday, so I wanted to post a quick note for those who haven’t run into them yet. The interface is really spartan and could use work on the usability front, but the idea itself is very solid: a maship between Google Maps and a hotel booking service.

I checked out the results for Beijing and was impressed enough to send a note. The nice thing is that they seem to be covering low-budget hotels in addition to the reams of luxury hotels that exist under the hopeful delusion that most foreigners in China are simply passing through to buy expensive jewelery on their way to Bali.  Maybe I’m just terminably cheap, but it’s nice to be able to pull up a map and quickly see where the more inexpensive hotels in any city are. I also thought it was cool that they had some budget hotels in smaller cities like Baotou, Inner Mongolia.

Consider yourselves all informed. In the past I’ve always just found a place by walking around until I stumble into something suitable. This is usually the cheapest way to get a hotel room, but it doesn’t work in some places (*cough* Shanghai). The next time I pass through at least *that* city, I will probably use it.

Older Posts »