mRMR FAQ
Q1.2 What are "MID"
and "MIQ"?
Q1.4 Where to download the mRMR
software and source codes?
Q1.5 What is the correct format
of my input data?
Q1.7 How to cite/acknowledge
mRMR?
Q1.9 Damage & other risks
& disclaimer?
2.
How to use the online version
Q2.1 Where is the online version
of mRMR?
Q2.2 Is it true that the online
version only considers the mutual information based mRMR?
Q2.3 What are the
input/parameters of the online program?
Q2.4 What should be the input
file and its format for the online version?
Q2.5 What is the meaning of the
output of the online program?
3.
How to use the C/C++ version
Q3.1 Where is the C/C++ version
of mRMR?
Q3.2 Can I use the C/C++ version
for Linux, Mac, Unix, or other *nix machines?
Q3.3 Can I use the C/C++ version
Windows machines?
Q3.4 Is it true that the C/C++
version only considers the mutual information based mRMR?
Q3.5 What are the
input/parameters of the C/C++ program?
Q3.6 What should be the input
file and its format for the C/C++ version?
Q3.7 What is the meaning of the
output of the C/C++ program?
Q3.8 Are the C/C++ program and
the online version the same?
Q3.9 Any help information
available when I try to run the C/C++ version?
Q3.10 The C/C++ binary program
hangs. Why?
4.
How to use the Matlab version
Q4.1 Where is the Matlab version
of mRMR?
Q4.2 Can I use the Matlab
versions for Linux, Mac, Unix, Windows, or other machines?
Q4.3 Is it true that the
released Matlab version only considers the mutual information based mRMR?
Q4.4 What are the
input/parameters of the Matlab version?
Q4.5 What is the output of the Matlab
version?
Q4.6 What are mrmr_mid_d ,
mrmr_miq_d , and mrmr_mibase_d?
Q4.7 Do the Matlab version and
C/C++ version produce the same results?
5.
Handle discrete/categorical and continuous variables
Q5.1 How mRMR handles continuous
variables?
Q5.3 What is the best way to
discretize the data?
Q6.1 A typo in Eq. (8) in your
JBCB05 paper (and also the CSB03 paper)?
Q6.2 Can I ask questions and
what is your contact?
1. Basic questions
Q1.1 What is mRMR?
A. It means minimum-Redundancy-Maximum-Relevance
feature/variable/attribute selection. The goal is to select a feature subset
set that best characterizes the statistical property of a target classification
variable, subject to the constraint that these features are mutually as
dissimilar to each other as possible, but marginally as similar to the
classification variable as possible. We showed several different forms of mRMR,
where "relevance" and "redundancy" were defined using
mutual information, correlation, t-test/F-test, distances, etc.
Importantly, for mutual information, we showed
that the method to detect mRMR features also searches for a feature set of
which features jointly have the maximal statistical "dependency" on
the classification variable. This "dependency" term is defined using
a new form of the high-dimensional mutual information.
The mRMR method was first developed as a fast
and powerful feature "filter". We then also showed a method to
combine mRMR and "wrapper" selection methods. These methods have
produced promising results on a range of datasets in many different areas.
Q1.2 What are
"MID" and "MIQ"?
A. MID and MIQ represent the Mutual Information
Difference and Quotient schemes, respectively, to combine the relevance and
redundancy that are defined using Mutual Information (MI). They are the two
most used mRMR schemes.
Q1.3 How to use mRMR?
A. There are three ways. 1) Prepare your data
and run our online program at the web site http://research.janelia.org/peng/proj/mRMR
. 2) Download the precompiled C/C++ version (binary) and run on your own
machine. 3) Download the Matlab versions (binary plus source codes) and run on
your own machine. You can find the software download links at our web site,
too.
Q1.4 Where to download the
mRMR software and source codes?
A. You can download different versions at this
web site, or follow the links to download the Matlab versions. The Matlab
version contains all key source codes, including the mRMR algorithm (in Matlab)
and mutual information computation toolbox (in C/C++ and can be compiled as
Matlab mex functions).
Q1.5 What is the correct
format of my input data?
A. See the answers to the respective mRMR
versions below.
Q1.6 How should I understand
the results of mRMR? For example, I ran it on a small 3-variable data set &
it gave me an output like 2 1 3.
Does that mean 2 is the least statistical dependent & 3 is most
dependent? That is, 2 is the most
relevant & 3 is least relevant?
A. That means the first selected feature is 2,
and the last is 3.
That also means the combination of 2 and 1 is
better than the combination of 2 and 3.
That also means 2 is the best feature if you
only want one feature. And "2 and 1" is the best combination if you
want two features.
However, this does NOT mean 3 is the least
relevant or most dependent. If you select features without considering the
relationship between all the features, but only between individual features and
the target class variable, you may get results that 3 is more relevant than 1.
However the combination of 2 and 1 is better than that of 2 and 3.
Q1.7 How to cite/acknowledge
mRMR?
A. We will appreciate if you appropriately cite
the following papers:
[TPAMI05] Hanchuan Peng, Fuhui Long, and Chris
Ding, "Feature selection based on mutual information: criteria of
max-dependency, max-relevance, and min-redundancy," IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp.1226-1238, 2005.
[PDF]
This paper presents a theory of mutual
information based feature selection. Demonstrates the relationship of four
selection schemes: maximum dependency, mRMR, maximum relevance, and minimal
redundancy. Also gives the combination scheme of "mRMR + wrapper"
selection and mutual information estimation of continuous/hybrid variables.
[JBCB05] Chris Ding, and Hanchuan Peng,
"Minimum redundancy feature selection from microarray gene expression
data," Journal of Bioinformatics and Computational Biology, Vol. 3, No. 2,
pp.185-205, 2005. [PDF]
This paper presents a comprehensive suite of
experimental results of mRMR for microarray gene selection on many different
conditions. It is an extended version of the CSB03 paper.
[CSB03] Chris Ding, and Hanchuan Peng,
"Minimum redundancy feature selection from microarray gene expression
data," Proc. 2nd IEEE Computational Systems Bioinformatics Conference (CSB
2003), pp.523-528, Stanford, CA, Aug, 2003. [PDF]
This paper presents the first set of mRMR
results and different definitions of relevance/redundancy terms.
[IS05] Hanchuan Peng, Chris Ding, and Fuhui
Long, "Minimum redundancy maximum relevance feature selection," IEEE
Intelligent Systems, Vol. 20, No. 6, pp.70-71, November/December, 2005. [PDF]
A short invited essay that introduces mRMR and
demonstrates the importance to reduce redundancy in feature selection.
[Bioinfo07] Jie Zhou, and Hanchuan Peng, "Automatic
recognition and annotation of gene expression patterns of fly embryos,"
Bioinformatics, Vol. 23, No. 5, pp. 589-596, 2007. [PDF]
One application of mRMR in selecting good
wavelet image features.
Q1.8 License information?
A. The mRMR software packages can be downloaded
and used, subject to the following conditions: Software and source code
Copyright (C) 2000-2007 Written by Hanchuan Peng. These software packages are
copyright under the following conditions: Permission to use, copy, and modify
the software and their documentation is hereby granted to all academic and
not-for-profit institutions without fee, provided that the above copyright
notice and this permission notice appear in all copies of the software and
related documentation and our publications (TPAMI05, JBCB05, CSB03, etc.) are
appropriately cited. Permission to distribute the software or modified or
extended versions thereof on a not-for-profit basis is explicitly granted,
under the above conditions. However, the right to use this software by
companies or other for profit organizations, or in conjunction with for profit
activities, and the right to distribute the software or modified or extended
versions thereof for profit are NOT granted except by prior arrangement and
written consent of the copyright holders. For these purposes, downloads of the
source code constitute "use" and downloads of this source code by for
profit organizations and/or distribution to for profit institutions in
explicitly prohibited without the prior consent of the copyright holders. Use
of this source code constitutes an agreement not to criticize, in any way, the
code-writing style of the author, including any statements regarding the extent
of documentation and comments present. The software is provided
"AS-IS" and without warranty of any kind, expressed, implied or
otherwise, including without limitation, any warranty of merchantability or
fitness for a particular purpose. In no event shall the authors be liable for
any special, incidental, indirect or consequential damages of any kind, or any
damages whatsoever resulting from loss of use, data or profits, whether or not
advised of the possibility of damage, and on any theory of liability, arising
out of or in connection with the use or performance of these software packages.
Q1.9 Damage & other
risks & disclaimer?
A. See the detailed disclaimer and conditions in
the answer to Q1.8. In short, we will NOT be liable for any
damage of any kinds, or loss of data, because you use the released software. It
is all at your own risk.
[Return to top] [Return to mRMR main page]
2. How to use the online
version
Q2.1 Where is the online
version of mRMR?
A. The web site http://research.janelia.org/peng/proj/mRMR
.
Q2.2 Is it true that the online
version only considers the mutual information based mRMR?
A. Yes. Since the mutual information based mRMR
produces most promising results and most used, the online program only uses
mutual information to define relevance and redundancy of features.
Q2.3 What are the
input/parameters of the online program?
A. You need an input file, of course (some
people just clicked "Submit job" without specifying anything…). You
also need to choose how the relevancy and redundancy terms should be combined
(i.e. MID or MIQ), how many features you want to select, the property of your
variables (categorical or continuous), and how you want to discretize your data
in case you have continuous data.
Q2.4 What should be the
input file and its format for the online version?
A. It should be a CSV (CSV, comma separated
values) file, where each row is a sample and each column is a
variable/attribute/feature. Make sure your data is separated by comma, but not
blank space of other characters! The first row must be the feature names, and
the first column must be the classes for samples. You may download a testing
example data set here, which is the microrray
data of lung cancer (7 classes). In this sample data set, each
variable/feature/column has been discretized as 3-states, encoded in digits
"-2", "0" and "2". You may use other integers
(such as -1, 0, 1) for the categorical/discrete states defined by yourself, -
but never use letters or combinations of digits and letters (such as
"10v"). Try not to use strange states such as 1001 or 10000 as the
program will use these values to guess what will be a reasonable amount memory
to allocate. For example, if each variable has only 5 states, then try to use
-2,-1,0,1,2, or 0,1,2,3,4, but NEVER use something like "-10000, -1000, 0,
1000, 10000"! (Note: The released version was only designed for the
obviously meaningful inputs.) More examples can be found at the mRMR web site,
too.
Your data can contain continuous values, except
the first column (which is the class variable) and first row (which is the
header). In this case, you can ask mRMR to do discretization for you. See FAQ part 5.
If you have variables that are continuous or
have many categorical states (e.g. several hundred), you may want to read more
the mRMR handle continuous variables. See FAQ part 5.
Q2.5 What is the meaning of
the output of the online program?
A. The meanings of most parts of the online
program are intuitive: it automatically compares the features selected using
the conventional maximum relevance (MaxRel) method and the mRMR. Suppose you
ask the program to select 50 features for you, then you can also truncate the
results and use the first 20 or 30 features. You can also test the
classification accuracy using the first K features, where K=1,…,50 in this
case. In this way, you can actually see with what number of features you will
get the satisfactory cross-validation classification accuracy. This method was
used in our papers, too.
The first column is the order of features
selected. The second column is the actual indexes of the features. The third
column includes the respective feature names extracted from your input file.
The last column is just the best score in the
process to select the *current* best feature. Indeed, it is the value of
"relevancy - (or /) redundancy" for MID (or MIQ) for the current
selected feature. Because for classification all selected features will be used
together, this score does not indicate anything for classification, thus it is
NOT important for you to use.
[Return to top] [Return to mRMR main page]
3. How to use the C/C++
version
Q3.1 Where is the C/C++
version of mRMR?
A. Follow the download links at the web site http://research.janelia.org/peng/proj/mRMR
.
Q3.2 Can I use the C/C++
version for Linux, Mac, Unix, or other *nix machines?
A. Yes. There are precompiled versions. If you
cannot find one for your machine, send me email and I will try to compile one
for you if possible.
Q3.3 Can I use the C/C++
version Windows machines?
A. Probably yes when you use Cygwin, but
untested. Or you may want to simply consider using the Matlab version or the
online version.
Q3.4 Is it true that the
C/C++ version only considers the mutual information based mRMR?
A. Yes. Since the mutual information based mRMR
produces most promising results and most used, the released C/C++ program only
uses mutual information to define relevance and redundancy of features.
Q3.5 What are the
input/parameters of the C/C++ program?
A. You need an input file with the same format
explained in Q2.4. You also need to choose how the
relevancy and redundancy terms should be combined (i.e. MID or MIQ), how many
features you want to select, the property of your variables (categorical or
continuous), and how you want to discretize your data in case you have
continuous data. Just type "mrmr" or the appropriate program name for
the help information.
Q3.6 What should be the
input file and its format for the C/C++ version?
A. The same of the online version. See Q2.4.
Q3.7 What is the meaning of
the output of the C/C++ program?
A. The same of the online version. See Q2.5.
Q3.8 Are the C/C++ program
and the online version the same?
A. Yes. The online program just provides you a convenient
way to run mRMR on relatively small datasets. If you have big datasets, try to
use the C/C++ version or the Matlab version.
Q3.9 Any help information
available when I try to run the C/C++ version?
A. Just run "mrmr" (or other
appropriate program name if you get a special one from me) and there will be
the default help information how to run it. You can also see some examples if
you use the web-based online program, which will display the command used at
the top of the result page.
Q3.10 The C/C++ binary
program hangs. Why?
A. This program does NOT hang except you feed it
an inappropriate input file. For example, some people use variables of hundred
of categorical states or continuous values, which cannot yield a meaningful
mutual information estimation in many cases. The released mRMR binary versions
are using mutual information for discrete variables (if you are interested in
the mutual information estimation for continuous variable, you can find the
formula in the TPAMI05 paper).
If you have continuous data with big dynamic
range (say, from 1 to 1000), the binary version mRMR program treats each
variable as one with 1000 categories, and thus the computation of mutual
information takes a long time to run, that is why you see the program "hangs".
I suggest you pre-thershold (discretize) your
data using some of your own favorite ways, in case you don't like set the
threshold as mean+/- std.
[Return to top] [Return to mRMR main page]
4. How to use the Matlab
version
Q4.1 Where is the Matlab
version of mRMR?
A. Follow the download links at the web site http://research.janelia.org/peng/proj/mRMR
, you will find the Matlab versions released to MatlabCentral.
For most commonly used machines, i.e. Linux,
Mac, and Windows, the simplest way is to download the mRMR Matlab version with
the precompiled mutual information toolbox. Or, you can download the mRMR codes
and the mutual information toolbox separately. The mutual information toolbox
was also written in C/C++ and can be recompiled as mex functions for Matlab
running on other platforms. These codes are essentially similar to the mutual
information computation in the C/C++ version of mRMR.
Of course, you will find the mutual information
toolbox useful for purposes other than mRMR. For the mRMR case, you will need
to set the Matlab path to this toolbox.
Q4.2 Can I use the Matlab
versions for Linux, Mac, Unix, Windows, or other machines?
A. Yes.
Q4.3 Is it true that the
released Matlab version only considers the mutual information based mRMR?
A. Yes.
Q4.4 What are the
input/parameters of the Matlab version?
A. Three arrays, D, F, and K. D is an m*n array
of your candidate features, each row is a sample, and each column is a
feature/attribute/variable. F is an m*1 vector, specifying the classification
variable (i.e. the class information for each of the m samples). K is the
maximal number of features you want to select.
D must be pre-discretized as a categorical
array, i.e. containing only integers. F must be categorical, indicating the
class information.
Q4.5 What is the output of
the Matlab version?
A. The indexes of the first K features selected.
This corresponds to the second column produced in the C/C++ version.
Q4.6 What are mrmr_mid_d ,
mrmr_miq_d , and mrmr_mibase_d?
A. The first, mrmr_mid_d, is MID . The second,
mrmr_miq_d, is MIQ. The third, mrmr_mibase_d, is maximum relevance selection
(for comparison). See Q1.2 for more explanations. You can
also read our papers for a comparison of MID and MIQ.
Q4.7 Do the Matlab version
and C/C++ version produce the same results?
A. Theoretically yes. Practically may have
slight difference in some cases, as they use different floating number
precisions (i.e. one uses double and the other uses float). However we have not
observed any dramatic differences between these two versions.
5. Handle
discrete/categorical and continuous variables
Q5.1 How mRMR handles
continuous variables?
A. mRMR is a framework where the relevance &
redundancy terms should be combined. Mutual information, which is used most of
the time, is a useful method to define these two terms, but other options
exist. There are three ways for mRMR to handle continuous variables.
(1) Use t-test / F-test (bi-class/multiclass)
for relevance measure and the correlation among variables as redundancy
measure. Other scores such as distances can also be considered. See the CSB03
& JBCB05 papers for details.
(2) Use mutual information of discrete
variables, - this needs to first discretize variables/features. We have chosen
to discretize them using mean+/-alpha*std (alpha=1 or 0 or 2 or 0.5). The
choice of alpha will have some influence on the actual features selected (more
correctly, the ordering of features you select, - but if you select several
more, you may find a lot of them are the same, although may be in different
order). This is actually a very robust way to select features. See the TPAMI05,
CSB03, JBCB05 papers for details.
(3) Use mutual information for continuous
variables. The mutual information can be estimated using Parzen windows. See
the TPAMI05 paper for the formulas. The computation of Parzen window is more
expensive than the discrete version of mutual information computation.
See Q1.7 for more
information of these papers.
Q5.2 My variables are
continuous or have as many as 500/1000 discrete/categorical states. Can I use
mRMR?
A. You can either use mRMR for continuous
variables (based on correlation, Parzen-windows mutual information estimation,
etc) or first discretize and use the mRMR for discrete variables.
Q5.3 What is the best way to
discretize the data?
A. In our experiments we have discretized data
based on their mean values and standard deviations. We thresholded at mean ± alpha*std,
where the alpha value usually range from 0.5 to 2. Typically 2, 3 or no more
than 5 states for each variable will produce satisfactory results that are
quite robust too. But you need to make a decision what would be the most
meaningful way to discretize your data.
Q5.4 Why your README file
mentions "Results with the continuous variables are not as good as with
Discrete variables"?
A. We showed some comparison results in our
papers (see Q1.7) on both discrete and continuous cases
using the same datasets, typically the discretized results are better. There
are several reasons, e.g. discretization will often lead to a more robust classification.
[Return to top] [Return to mRMR main page]
6. Other questions
Q6.1 A typo in Eq. (8) in
your JBCB05 paper (and also the CSB03 paper)?
A. Yes, there is a typo in JBCB05 paper (and
also the CSB03 paper), Eq. (8) on page 189, the term (gk-g) should come with a
square like (gk-g)^2.
Q6.2 Can I ask questions and
what is your contact?
A. Yes, you can ask questions if you cannot find
answers above. My contact is Hanchuan (dot) Peng (at) gmail (dot ) com.
[Return to top] [Return to mRMR main page]