Developer Corner: Adso with Your Own C/C++ Application
Jan 4th, 2009 by trevelyan
This is the first in a series of posts intended for developers working on Chinese language processing tools, such as text to speech or bilingual search applications. Our goal is demonstrating how to use Adso to make your own application smarter and smaller, and allow you to focus on solving your domain-specific issues instead of re-inventing the wheel by writing another Chinese language parser. So in this post we’re focusing on integrating Adso with other C/C++ applications.
First things first, download Adso, unpack it and enter the /source subdirectory. Assuming you have a Unix system, you can compile and install the software as follows:
- ./prepare_internal
- make
- make install
This compiles the “internal” version of Adso which includes our database in the actual binary. The reason you probably want to do this (instead of compiling the MySQL or SQLite version) is that this lets you compile your application statically without any external dependencies. If you want to treat Adso as anything except a black box you can always use the MySQL or SQLite versions which make it easier to manipulate database content in real-time. There are instructions on installing those versions as part of our distribution so moving on, now that the software is installed you should be able to test that it works:
./adso –help
We’re not interested in compiling the engine from scratch though, we’re interested in building new and sexy applications that take advantage of Adso to do incredible things with YOUR software. So enter the /libsrc subdirectory and open the file “main.cpp”. You should see something like this:
#include <iostream>
#include “../adsointerface.h”int main(int argc, char **argv) {
AdsoInterface *adsoInterface = new AdsoInterface();
std::cout << adsoInterface->pinyinize(”Adso是一个自然语言分析系统”) << std::endl;
delete adsoInterface;
}
If you compile and run this software (type “make” in the subdirectory) you’ll see that it prints out some segmented pinyin. How much easier can we make it. Just remember to note a few things:
- We point the software to our header file “adsointerface.h”. The location of this file on your system will vary depending on where you unpackaged Adso. Since we’re in a subdirectory of the project we’re just referencing the version in the parent directory right now.
- Somewhere in our application we create a new AdsoInterface object.
- We pass our text to this object in the form of a string, and get our results returned in the form of a string. AdsoInterface supports three functions right now, each of which take a string as input and return a string as output. translate(std::string x), pinyinize(std::string x) and segment(std::string x).
- Remember to delete your AdsoInterface object when finished with it. You can reuse the object to process multiple pieces of text. Don’t create a separate AdsoInterface object each time you want to convert a snippet of text to pinyin or segment it.
Now that we know how to black-box things like pinyin conversion or translation, it’s important to point out that Adso affords you a tremendous amount of flexibility in the way it processes Chinese text. So even though it is really easy to integrate Adso, it is possible to really customize the software. You can even create your own highly arbitrary rulesets which selectively manipulate and process the text. We’re going to get into specifics of how to accomplish this in some later tutorials.
For now the best place to get started is actually to open the file adsointerface.cpp and look at what the code is doing. In order to give a very short example though, let’s solve a specific problem: specifying input and output text encodings. By default Adso will evaluate the text to determine its encoding and script and return content in whatever encoding and script you feed it. Sometimes you may want to specify an input or output encoding though. This also marginally speeds up the software, so is useful to know it any case. And in this case, changes can be made by accessing the master Text, Parser, and Encoding objects held in the AdsoInterface object. To specify the input and output encodings and scripts, for instance, you can add something like this to your code after initialization:
- adsoInterface->encoding->input_encoding = 1;
- adsoInterface->encoding->input_script = 1;
- adsoInterface->encoding->output_encoding = 2;
- adsoInterface->encoding->output_script = 2;
This forces the software to treat incoming text as the gb2312/18030 encoding (1) in the simplified script (1), and treat output text as traditional script (2) in the utf-8 encoding (2).
Enough for this post. In future tutorials we are going to talk about solving specific problems. But enough for now. If you are a developer of related applications and have problems or questions, please feel free to write us. If you are a user and wish to help us in our efforts to promote this sort of resource, I’d encourage you to check out what we’re doing with Popup Chinese and consider joining our data generation efforts by using and helping to edit our online Chinese dictionary.