Table of Contents

Chaski

Chaski is a distributed toolkit for machine translation. It contains the following tools:

  1. Distributed word clustering. Being able to build word classes for billion-word corpus.
  2. Distributed word alignment. Using the newest version of overview, it is able to training word alignment models on the cluster in hours instead of days.
  3. Distributed phrase extraction. The phrase extraction for large corpus turns turns out to be slow and require huge disk space and (actually, or) memory, the Chaski can extract phrases in a very high speed and make use of HDFS to store intermediate files so as to alleviate the disk usage.

On Yahoo!'s M45 cluster, Chaski performed full training, i.e. start from raw parallel data, output Moses compatible phrase table and reordering table, on a 6 million sentence pair corpus in 8 hours. In a single machine, this usually takes one week 1).

Using Chaski is easy, a typical training just requires two commands:

setup-chaski-full  Source-corpus  Target-corpus  HDFSRoot > chaski.config

will setup a configuration file that you can fine-tune, and then

train-full chaski.config

will run through the pipeline and get the phrase table ready.

Chaski is also flexible, the config file contains a lot of options that you can adjust to maximize the speed according to your cluster's setup.

Currently Chaski is mainly develop on Yahoo's M45 cluster and we do appreciate if you can use it and test it on other clusters. Bug report is also appreciated!

Download

Please visit download page for most up-to-date release of Chaski.

Installation

Please see install for detail about installing Chaski

HOWTO

tutorial provides a simple tutorial how to run Chaski and also explanation of its configurations.

If you encounter problems, please let me know and I will add it into FAQ

1) With MGIZA++ on quad-core machine, you may be able to get it done in four or five days.
chaski/overview.txt · Last modified: 2009/11/28 11:37 by edwardgao
CC Attribution-Noncommercial-Share Alike 3.0 Unported www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0