MGIZA++ is a multi-threaded word alignment tool based on GIZA++. It extends GIZA++ in multiple ways:
Multi-threading
MGIZA++ can make use of multi-core platforms efficiently. Usually a quad-core machine can have a three-fold speedup over single-thread GIZA++.
Memory optimization
By eliminating duplicated tables, MGIZA++ can save a lot of memory comparing to GIZA++.
Resume training
MGIZA++ can resume training from any stage and continue training. For example you may be able to re-use previous available models and continue training directly from IBM Model 4 instead of all the way from Model 1.
Integrated with Chaski
MGIZA++ can be integrated into Chaski and run on cluters, which will give you even larger speedup.
Native Windows support
MGIZA++ can now be compiled in Visual Studio, providing native MS Windows support. The latest version is, however, not stable when compiled as 64bit.
If MGIZA++ helps you, please be kind to cite the following paper in addition to the GIZA++ one:
Qin Gao, Stephan Vogel, “Parallel Implementations of Word Alignment Tool”, Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 49-57, June, 2008 pdf bib
Latest version of MGIZA++ can be download here:
| Version | Data | Link | Comment | Release Note |
|---|---|---|---|---|
| Version 0.7.3 | 2013-01-19 | Download | Several fix for compilation issues | Release Note Fixing Boost Library, From Amittai |
| Version 0.7.2 | 2012-05-26 | Download | A functional upgrade, mgiza can now automatically set number of threads. | Release Note Fixing Boost Library, From Amittai |
| Version 0.7.1 | 2012-05-26 | Download | A functional upgrade, provide low-memory support for snt2cooc. | Release Note |
| Version 0.7.0 | 2011-12-30 | Download | A functional upgrade, provide native Microsoft Windows support (tested on Visual C++ 10.0 32Bit) | Release Note |
| Version 0.6.3.1 | 2010-01-23 | Download | Minor code clean and move download to Sf | Release Note |
| Version 0.6.3 | 2010-01-11 | Download | Memory optimization and bug fix | Release Note |
| Version 0.6.2 | 2009-12-07 | Download | Minor interface change to keep compatibility with Chaski 0.2.2 | Release Note |
| Version 0.6.1 | 2009-11-17 | Download | Unnecessary dependencies removed | Release Note |
| Version 0.6 | 2009-11-10 | Download |
To compile MGIZA++ you need the following package installed:
After the dependencies are installed. As of version 0.6.1, you do not need the dependencies of berkeley db, but you still need boost library.. Just go to the source directory of the source and
./configure --prefix=${QMT_HOME} make make install
If you want to use MGIZA++ with Chaski, you need to add the environment variable QMT_HOME to your .bashrc.
For boost library you can either download it from http://www.boost.org or install the header package of your linux distribution.
In additional to GNU autotools, MGIZA++ now supports CMake. Use
cmake make make install
to build and install, on both Windows and Linux.
The code has been modified to provide native Windows support. You can now generate Visual Studio solution and build it by invoking:
cmake -G "Visual Studio 10" msbuild /p:Configuration=Release mgiza.sln
You will need Visual Studio and staged Boost library, the environment variable BOOST_ROOT need to be set to the STAGED boost directory.
After compilation, all the binaries should be generated in bin/Release directory.
To run mgizapp binary, you need to copy w32\pthreadlib.dll to a system directory, such as C:\Windows.
You can also compile 64-bit Windows version of mgiza, but it is not stable, almost always pop an error AFTER the program exits. I.e. you can still get your job done but has to click in order to continue. I am fixing this but have no clue yet.
To build, checkout the latest source from SVN:
svn co https://mgizapp.svn.sourceforge.net/svnroot/mgizapp cd mgizapp cmake -G "Visual Studio 10 Win64" msbuild /p:Configuration=Release mgiza.sln
Update: use the following configuration may get you further:
svn co https://mgizapp.svn.sourceforge.net/svnroot/mgizapp cd mgizapp cmake -G "Visual Studio 10 Win64" msbuild /p:Configuration=RelWithDebInfo mgiza.sln
Now you will need pthreadlib64.dll copied to your system directory.
The basic usage of MGIZA++ is easy, given that you know how to run GIZA++. MGIZA++ is compatible with GIZA++'s parameters, and you can run:
${QMT_HOME}/bin/mgiza -ncpu 5 [ALL-YOUR-GIZA-PARAMETERS]
to tell mgiza to run five-threads.
The alignment output of MGIZA++ is somehow different from GIZA++, given n-threads, the alignment output will be:
prefix.A3.final.part0 prefix.A3.final.part1 ... prefix.A3.final.part(n-1)
To combine the alignments you need to run:
${QMT_HOME}/scripts/merge_alignment.py ${prefix}.A3.final.part* > ${prefix}.A3.final
For advanced usage please refer to the following “HOWTOs”