Nowadays, in many highly secured places(banks, government offices etc)we could notice face verification system. The critical job of the system is to confirm whether the person is an authenticated employee. Ever wonder how the system works in the background. Siamese Networks assists such kind of verification process. The application is not just confined to face verification but also expanded to signature confirmation by comparing whether two signatures are from the same customer.
Let’s unravel the working principle behind the Siamese Neural Networks from the original research paper.
What is one-shot learning? If we take the case of usual image classification tasks, the data processing pipeline includes gathering a similar set of images and labelling them based on the category. This is followed by employing CNN which learns the image features for image classification. But, such a type of setting is not suitable for face/signature verification.
As it is not practical to gather multiple images of employees and give a label to each set of images to identify the person. Instead, each new class will have only one example for the model before prediction. During inference time, another similar data will be fed into the model. Such kind of scenarios is referred to as one-shot learning. This approach is quite different from zero-shot learning, where the model never sees the training instance corresponding to the target class.
Siamese Networks: Employs a unique structure to rank similarity between inputs. Once the network has been tuned, we can leverage the network not just to the new data but to entirely new classes from unknown distributions.
Deep Siamese Networks for Image Verification: The name is derived from the siamese twins. A siamese network consists of twin networks that accept distinct inputs and they are joined by a metric function at the top(basically compares the output feature from both the network for similarity). The learnable parameters(weights & bias) are tied between the two networks.
- If the two input images are almost similar, then the feature representation from the networks will be the same because of the same parameters. Initially(2005), a contrastive energy function contained dual terms to decrease the energy of like pairs and increase the energy of unlike pairs.
- But as per this research paper, a weighted L1 distance between the twin feature vectors h1 and h2 along with sigmoid activation is incorporated. Thus cross-entropy is the loss function used for training the network.
- The best performing model uses multiple layers of CNN before the fully connected layers at the top and the energy function.
- The units in the final convolution layer are flattened into a single vector. This is followed by a fully connected layer and then one more layer computing the induced distance metric between each Siamese twin, which is given to a single sigmoidal output unit.
Learning & Loss function: For two inputs x1 & x2, the assumption is y(x1, x2) = 1 when the inputs are similar, zero otherwise.