PyTom: Classify aligned particles

Overview

The aim of classification is to group structurally different particles that are part of a particle list into different bins. You may want to use classification directly after localization to discard obvious false positives or 'bad' particles or you may want to classify subtomograms after alignment to get some insights into possible conformational hetergeneity of your molecules of interest.

In PyTom, there are currently two different methods for subtomogram classification implemented: (i) Constrained Principal Component Analysis (CPCA) in conjunction with k-means clustering and (ii) multiple correlation optimization (MCO). Both methods and their usage are explained in the following.

CPCA-based classification using calculate_correlation_matrix.py and classifyCPCA.py

CPCA is explained in detail in the original publication Foerster et. al. 2008. Classification by CPCA consists of three major steps. Firstly, the constrained correlation matrix of all subtomograms is computed. This matrix contains all pairwise constrained correlation coefficients (CCCs) of the subtomograms. Computation of the CCC is by far the computationally most demanding step. Hence, we have parallelized this step. Secondly, the principal components of this matrix are computed. In this step, the data is compressed using a specified number of eigenvectors. Thirdly, the reduced data are clustered using a kmeans method.

The CCC matrix is computed using the script

mpirun --hostfile "pathToYourHostfile" -c "numberOfCPUs" pytom PathToPytom/classification/calculate_correlation_matrix.py -p "ParticleList" -m "MaskFile" -f "Lowpass" -b "Binning"
The arguments are: The script will generate a file called 'correlation_matrix.csv'. It contains the CCC in an ascii format.

The CCC is further used for classification using the script classifyCPCA.py. This script computes the eigenvectors of the CCC and projects the data on the first neig eigenvectors. Subsequently, these multidimensional vectors are clustered into nclass groups using a kmeans algorithm. The usage is:

pytom PathToPytom/bin/classifyCPCA.py -p "ParticleList" -o "OutputParticleList" -c "CCC" -e "neig" -n "nclass" -a "Average"
In detail the parameters of the script are: The output of the method are the ParticleList with assigned classes as well as the different class averages.

Multiple correlation optimization using mcoEXMX.py or mcoAC.py

Another option is to use a classification algorithm based on correlation to different class averages (multiple correlation optimization, MCO). The goal is to find a distribution of particles that optimises the correlation to a set of different references.

MCO comes in two different flavors: the simple version is a optimization by a local, gradient optimization (expectation maximization) akin to k-means classification (MCO-EXMX). Alternatively, stochastic elements (simulated annealing) can be incorporated into the classification procedure (MCO-AC) (Hrabe et. al. 2012).

In short, the idea is to group the subtomograms such that they match best to a class average. In an iterative procedure the assignment of particles is changed such that the (constrained) correlation of the class average to its respective class members get optimized (usage of the constrained correlation makes sure that the missing wedge is considered in the similarity measure). The simplest optimization algorithm is expectation maximization (EXMX), which is also the basis for subtomogram alignment. This algorithm converges rapidly, but it can end up in local minima. Therefore, we designed an 'outer loop' for performing EXMX multiple times in the AC algorithm. At the start of each iteration the class assignments are perturbed and then optimized using EXMX. The new assignments are accepted or rejected according to the so-called Metropolis criterion, which depends on a pseudo-temperature. Thus, MCO-AC is a (computationally much more demanding) extension of MC-EXMX.

The method are called using the scripts mcoEXMX.py or mcoAC.py like this:

mpirun --hostfile "pathToYourHostfile" -c "numberOfCPUs" pytom PathToPytom/bin/mcoAC.py -j class.xml
The job file 'class.xml' can be created using the PyTom user interface (see below). Make sure that all particles in the particle list have either the same class label (e.g., <Class Name="0"/> ) or no class property at all. The program will take more than one class as a priori distribution and continue from there.

Parameters for MCO-EXMX and MCO-AC

Required parameters for classification jobs are:

Output of MCO-EXMX and MCO-AC

We start with the simpler Output of MCO-EXMX. Inside the destination directory, you will find:

Output of MCO-AC in the destination directory:

MCO classification on the tutorial data

The tutorial shows how to apply both types of classification to the tutorial data. Particles are classified directly after localization in the postLocalizationClassification with MCO-EXMX and classification after alignment with MCO-AC. Please note that the order was chosen based on our standard workflow but can be swapped for individual adjustment.

Merging classes

It is a common problem of classification algorithms that they tend to ignore small classes. A solution to this problem is to oversample the class number in the classification: e.g., if ~4 classes are expected to be present in the dataset one may initially group the dataset into ~20 classes, which increases the chances that a sparse class will be distinguished and not fused into a more populated class. Thus, of the ~20 classes, many classes will be essentially identical, which can subsequently be merged again.

In order to merge classes by hirarchical clustering, you have to run the following commands.

  1. Average: Generate class averages with this script:
    $ ipytom
    from pytom.basic.structures import ParticleList
    pl = ParticleList()
    pl.fromXMLFile('bestAfter31Its.xml')
    pls.sortByClassLabel()
    pls = pl.splitByClass()
    for i in xrange(len(pls)):
        c = pls[i]
        print i, ' == classlabel : ', c[0].getClassName(), ' Class size: ' , len(c)
        c.average('classAverages/class' + str(i) + '.em')

    classAveragesList = ParticleList('classAverages')
    classAveragesList.loadDirectory()
    classAveragesList.toXMLFile('classAverageList.xml')
  2. CCC: Determine a correlation matrix between all class averages, which can then be used for hierarchical clustering. Copy the below into a script and run either in parallel or sequential.
    $ ipytom
    from pytom.cluster.correlationMatrixStructures import CorrelationMatrixJob
    from pytom.basic.structures import ParticleList,Mask,Particle
    from pytom.cluster.correlationMatrix import distributedCorrelationMatrix
    import os
    pl = ParticleList()
    pl.fromXMLFile(''classAverageList.xml')

    mask=Mask('./mask.em')
    job = CorrelationMatrixJob(particleList=pl, mask=mask,
        resultMatrixName = 'correlationMatrix.em',applyWedge=False,
        binningFactor=2, lowestFrequency=0, highestFrequency=17)
    distributedCorrelationMatrix(job,False)
  3. CCC to ascii: Convert matrix from EM format to text file that allows import to R:
    $ ipytom
    from pytom_volume import read
    from pytom.tools.files import writeMatrix2RMatrix
    m = read('correlationMatrix.em')
    writeMatrix2RMatrix(m,'correlationMatrix-R.txt')
  4. Clustering: Code for hierarchical classification using the following commands in scipy:
    $ ipytom
    from pytom.basic.files import read
    from pytom_volume import abs
    from pytom_numpy import vol2npy
    from scipy.cluster.hierarchy import linkage,dendrogram
    from matplotlib.pyplot import show

    matrix = abs(read('correlationMatrix.em')-1)
    classification= linkage(vol2npy(matrix))
    print classification
    dendrogram(classification)
    show()
    You should see a hierarchical tree that suggests which classes should be merged.
  5. Merge classes: Using the below script you first split the particle list according the previously defined (too fine) classes and subsequently merge selected classes (in this example class 0, 2, and 3 into one class, classes 1 and 4 into another, and 5 and 6 into a third new class).
    $ipytom
    from pytom.basic.structures import ParticleList
    pl = ParticleList() pl.fromXMLFile(''classAverageList.xml')
    pls.sortByClassLabel()
    pls = pl.splitByClass()

    proteinState1 = pls[0] + pls[2] + pls[3]
    proteinState1.average('proteinInState1')

    proteinState2 = pls[1] + pls[4]
    proteinState2.average('proteinInState2')

    outliers = pls[5] + pls[6]
    outliers.average('outliers')

Setting up a classification job using the web server

Setting up a classification job is essentially identical to setting up of an alignment job. Differences are the parameters specifying the number of classes and so forth.

Create an classification job in the terminal with mcoEXMXJob.py and mcoACJob.py

Please note that we provide the mcoEXMXJob.py and mcoACJob.py scripts to set up classification jobs through the terminal. Check the terminal parameters in mcoEXMXJob.py --help and mcoACJob.py --help for details.

Example

Coarse classification after localization

You can perform classification directly after localization in order to get rid of obvious false positives such as gold beads. In order to do so, we recomend to set up an MCO-EXMX classification job (without annealing).
Hence, you can directly classify your reconstructed subtomograms using the coarse alignment determined during localization. Please read below how to set up an MCO-EXMX job, which is essentially identical to setting up a coarse MCO-AC job. You will have to use the particle list you obtained after localization.

After MCO-EXMX, you want to select the good subtomograms. You can do so by running the following script:

$ipytom
from pytom.basic.structures import ParticleList

pl = ParticleList()
#6 is the iteration number
#classifiedParticles.xml is the result
pl.fromXMLFile('./6/classifiedParticles.xml')
pl.sortByClassLabel()
pls = pl.splitByClass()

for cl in pls:
    className = cl[0].getClassName()
    cl.average('class' + str(className) + '.em')
    print className, ' contains ' , len(cl) , ' particles'

#choose the corresponding result class you want to use for further processing, e.g., classes 1,2,3,4
#the decision on which particles to take you will make based on the averages written out above
ribos = pls[1] + pls[2] + pls[3] + pls[4]

ribos.toXMLFile('filtered_pl.xml')
ribos.average('./filteredIniRef.em')