Welcome to Qin Gao's software page, hope you can find something useful here


2012/6/9 New MGiza version is up online, auto detecting number of cores in the system. . It can be downloaded here. Please read release notes.

2012/5/26 New MGiza version is up online, added reduced memory support for snt2cooc. . It can be downloaded here. Please read release notes.

2011/12/30 New MGiza version is up online, added native Windows support. (Win32, partial for Win64). It can be downloaded here. Please read release notes.

2011/12/29 Uploaded new force alignment scripts that is compatible to Moses. It can be downloaded here.

2010/05/10 Updated instruction for force alignment, thanks to Arek.

2010/03/08 Bug fix for Chaski Download

2010/01/23 Release of Chaski and MGIZA will be on SourceForge

Chaski on Sf

2010/01/11 Maintenance release of Chaski (0.2.3) and MGIZA (0.6.3)

Important bug fix for MGIZA. If you encounter segmental fault during model 3 training, please use the latest version. 0.6.3

2009/12/07 Maintenance release of Chaski (0.2.2) and MGIZA (0.6.2)

2009/11/27 Maintenance release of Chaski (0.2.1)

Release Notes

2009/11/24 Configuration documentation for MGIZA++

A (almost) complete list of MGIZA++ configuration documentation is online now: MGIZA++ Configuration

2009/11/11 New verison of Chaski !

I am glad to release the new version of Chaski, the functionality of Chaski package is greatly extended in the new version. Now PGIZA is integrated into Chaski and a new distributed word clustering tool, which means you can start from raw corpus and build the complete phrase table and lexiconized reordering model compatible with Moses. Instead of waiting weeks on single machine, the full training of a 6 million sentence pairs corpus now takes half a day on Hadoop cluster.

Please see overview for more detail of how to download/install the new version. And any suggestion/bug report is appreciated.

Hadoop related

Chaski : A software package for training phrase-based machine translation system on Hadoop clusters, together with MGIZA it can train large scale model in hours.

HadoopDaemon : A simple interface to help you run ANY program using hadoop. I.E. it makes Hadoop more like Condor or Maui/Torque, which may appear to be bad… But sometimes you need it, because going through the MapReduce framework may just screw up your files. (And you don't have other choices since Hadoop is the only way to run you job…)

Word Alignment

MGIZA++ : Multi-threaded GIZA++. It is a extended and optimized version of GIZA++, which can run multi-threaded, and provide additional functionalities/optimizations such as:

  • Resume training from previous models. You may restart training from any step give previous model.
  • Memory usage optimization. Eliminate duplicated tables in memory, which may save hundreds of megabytes of memory. It is crucial for distributed alignment.
  • Integrate with Chaski. The verison is fully integrated with Chaski and therefore can be run on Hadoop clusters. (Currently only work for Hadoop 0.20.1+)
start.txt · Last modified: 2012/06/08 22:09 by edwardgao
