MGIZA

MGIZA++ is a multi-threaded word alignment tool based on GIZA++. It extends GIZA++ in multiple ways:

Multi-threading

MGIZA++ can make use of multi-core platforms efficiently. Usually a quad-core machine can have a three-fold speedup over single-thread GIZA++.

Memory optimization

By eliminating duplicated tables, MGIZA++ can save a lot of memory comparing to GIZA++.

Resume training

MGIZA++ can resume training from any stage and continue training. For example you may be able to re-use previous available models and continue training directly from IBM Model 4 instead of all the way from Model 1.

Integrated with Chaski

MGIZA++ can be integrated into Chaski and run on cluters, which will give you even larger speedup.

Native Windows support

MGIZA++ can now be compiled in Visual Studio, providing native MS Windows support. The latest version is, however, not stable when compiled as 64bit.

If MGIZA++ helps you, please be kind to cite the following paper in addition to the GIZA++ one:

Qin Gao, Stephan Vogel, “Parallel Implementations of Word Alignment Tool”, Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 49-57, June, 2008 pdf bib

Download

Latest version of MGIZA++ can be download here:

Version Data Link Comment Release Note
Version 0.7.3 2013-01-19 Download Several fix for compilation issues Release Note Fixing Boost Library, From Amittai
Version 0.7.2 2012-05-26 Download A functional upgrade, mgiza can now automatically set number of threads. Release Note Fixing Boost Library, From Amittai
Version 0.7.1 2012-05-26 Download A functional upgrade, provide low-memory support for snt2cooc. Release Note
Version 0.7.0 2011-12-30 Download A functional upgrade, provide native Microsoft Windows support (tested on Visual C++ 10.0 32Bit) Release Note
Version 0.6.3.1 2010-01-23 Download Minor code clean and move download to Sf Release Note
Version 0.6.3 2010-01-11 Download Memory optimization and bug fix Release Note
Version 0.6.2 2009-12-07 Download Minor interface change to keep compatibility with Chaski 0.2.2 Release Note
Version 0.6.1 2009-11-17 Download Unnecessary dependencies removed Release Note
Version 0.6 2009-11-10 Download

Installation

To compile MGIZA++ you need the following package installed:

  1. Berkeley DB (libdb)
  2. Berkeley DB++ (libdb++)
  3. Boost library. (string), the latest version (7.0.0+) require boost::thread., and therefore requires a staged or installed boost library.

After the dependencies are installed. As of version 0.6.1, you do not need the dependencies of berkeley db, but you still need boost library.. Just go to the source directory of the source and

  ./configure --prefix=${QMT_HOME}
  make
  make install

If you want to use MGIZA++ with Chaski, you need to add the environment variable QMT_HOME to your .bashrc.

For boost library you can either download it from http://www.boost.org or install the header package of your linux distribution.

Install with CMake

In additional to GNU autotools, MGIZA++ now supports CMake. Use

cmake 
make
make install

to build and install, on both Windows and Linux.

Compile on Windows

The code has been modified to provide native Windows support. You can now generate Visual Studio solution and build it by invoking:

cmake -G "Visual Studio 10"
msbuild /p:Configuration=Release mgiza.sln

You will need Visual Studio and staged Boost library, the environment variable BOOST_ROOT need to be set to the STAGED boost directory.

After compilation, all the binaries should be generated in bin/Release directory.

To run mgizapp binary, you need to copy w32\pthreadlib.dll to a system directory, such as C:\Windows.

Partial support for Win64

You can also compile 64-bit Windows version of mgiza, but it is not stable, almost always pop an error AFTER the program exits. I.e. you can still get your job done but has to click in order to continue. I am fixing this but have no clue yet.

To build, checkout the latest source from SVN:

svn co https://mgizapp.svn.sourceforge.net/svnroot/mgizapp
cd mgizapp
cmake -G "Visual Studio 10 Win64"
msbuild /p:Configuration=Release mgiza.sln

Update: use the following configuration may get you further:

svn co https://mgizapp.svn.sourceforge.net/svnroot/mgizapp
cd mgizapp
cmake -G "Visual Studio 10 Win64"
msbuild /p:Configuration=RelWithDebInfo mgiza.sln

Now you will need pthreadlib64.dll copied to your system directory.

Usage

The basic usage of MGIZA++ is easy, given that you know how to run GIZA++. MGIZA++ is compatible with GIZA++'s parameters, and you can run:

  ${QMT_HOME}/bin/mgiza  -ncpu 5 [ALL-YOUR-GIZA-PARAMETERS]

to tell mgiza to run five-threads.

The alignment output of MGIZA++ is somehow different from GIZA++, given n-threads, the alignment output will be:

prefix.A3.final.part0
prefix.A3.final.part1
...
prefix.A3.final.part(n-1)

To combine the alignments you need to run:

${QMT_HOME}/scripts/merge_alignment.py ${prefix}.A3.final.part* > ${prefix}.A3.final

For advanced usage please refer to the following “HOWTOs”

mgiza/overview.txt · Last modified: 2013/06/23 13:01 by edwardgao
CC Attribution-Noncommercial-Share Alike 3.0 Unported www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0