Research at ExB

Artificial Intelligence makes many things possible. We’re making them happen.

AI is a large and varied research topic. At ExB, we invest almost a third of our budget in research and development. Our focus is on applied research, especially machine learning (ML) and deep learning (DL), in areas such as natural language processing (NLP) and image processing for practical purposes. Our goal is to enable cost-effective, high quality uses. To do that we often have to combine different AI techniques in new and unique ways.

Our Research in numbers

  • 9

    PhDs

  • 30+

    Patents

  • 50+

    Publications

Applied Research

That one step further means making it practical

For us, doing research means avoiding any complex theoretical promises of AI. Instead we look towards solving real-world problems. This also entails considering how a solution can benefit customers the most.

For example, in NLP in academic settings the usual approach is to concentrate exclusively on pure texts, such as news articles. The reality is often completely different, involving processing of bills, insurance claims, complaints, surveys, Excel tables, PowerPoint slides and documents with little linguistic information, but a lot of meaningful visual text block placement. The most common use case is the extraction of information from tables or other such semi-structured data. Extraction of information from pure texts with proper sentences, phrases or other linguistic structures is actually quite rare.

As a result, very advanced techniques common for NLP (such as high-end parsing solutions) often have little to no influence on the achieved scores. On the other hand, taking visual placement of text blocks on the original document into account has a tremendous influence. For us, this means that we must go beyond these well-trodden paths and combine multiple data modalities in unique ways, such as for visual document analysis.

Our approach

From pure theory to stable applications

Working at ExB as a researcher means to keep track of a rapidly evolving scientific field, but also to constantly verify ideas from that field and going that one more step to adapt them to practical purposes.

Applied research as a methodical work process

Specifically this means: analyzing data, reading literature about the state-of-the art, selecting the most promising experiments, and reproducing them. Going one step further means verifying their effectiveness on real data, rework them together with our software engineers such that they always stay within certain compute- and space limits, for example.

As a result we get a valuable software solution, with a properly documented approach, detailed measurements on dozens of different data sets. All of that as a whole, integrated product, the Cognitive Workbench.

Comparability and transparency

Every single AI training is subject to five-fold crossvalidation. This means that four fifth of the data will be used for training, while the last part will remain unknown to the AI and later be used to evaluate its power to generalize to unseen data. This is repeated for each possible fold. Finally, the training is applied on all available data for the productive, deployable trained model.

Each training results in a trained model. This is a resource that can be packaged with all the necessary run-time components and exported in the form of a docker image, which can then be installed anywhere and used on any amount of data (e.g. via a REST-API). The trained model also contains a lot of additional information available during training. This includes any achieved scores during the crossvalidation, information about the training data, who performed the training, etc.

This allows to inspect any trained model, compare it with other trainings on the same data, or decide to export and deploy it for productive usage.

Supervised and unsupervised learning

The solution often lies in clever combinations

Way back in 2009, before the mainstream caught up, we started to use unsupervised training in combination with supervised training. This leverages knowledge hidden in large and unlabeled data and allows to learn more from labeled, but therefore usually small data sets.

Unsupervised training on large data sets

Unsupervised training may run against a full web crawl, consisting of hundreds of millions sentences. This would produce a language model. Something that encodes the general usage of words, sentences, etc. of the language in question. This language model is independent from what it is going to be used for later on. It encodes a lot of linguistic knowledge, such as word similarity (“house” and “building” are semantically similar), syntax of words (“running” and “tidying” have a similar sentence function), ambiguity, etc. The unsupervised training process derives all of this information by itself from the data. Imagine seeing many sentences with contexts like “the house was built in” or “the building was built in”. Even without any knowledge in common English everyone would be able to deduce that “house” and “building” appear to be interchangeable.

Unsupervised learning evolved a lot in the past 10 years. It went from complicated co-occurrence counting based methods specifically for semantic vs. syntactic unsupervised models towards the current state-of-art of Neural Network based, fully contextualized, character-level-based language model that assign an individual vector to each word.

This vector has many properties, such as being similar (but not necessarily equal) to the vector of the same word in a slightly different context. Or being different for the same word in a completely different context. The character-based approach allows to assign good vectors even to words that were not seen in the unsupervised training (so-called out-of-vocabulary words). It also means that words with OCR errors in them will get sensible vectors most of the time. These modern methods are simpler by design, but much more demanding in terms of required compute power.

As of 2019, training a good language model requires at least a small super computer with four Tesla GPUs and will train for a couple of weeks.

Supervised training

Supervised training is the task to find a model that reproduces a given set of training examples, and generalizes to another set of test data, which was not seen during training. An additional validation data set is often used in-between for tuning hyper parameters of the supervised training.

Traditionally, algorithms like Support Vector Machine (SVM), or Conditional Random Field (CRF) were used in this area. Recently, though, supervised training is usually implemented by using neural network solutions with components like convolutional neural networks (CNN), or Long-Short-Term-Memory (LSTM).

ExB developed an own architecture for separating feature generation (designed, or pre-learned) from actual machine learning, so that the machine learning algorithm can be easily exchanged. At the same time, feature generation can also be easily expanded without affecting the machine learning layer in any (negative) way.

Top performance at any competition

Our technology is leading world-wide. This has been proven at many challenges and corresponding awards.

ISBI 2016

Best score at the Lesion Segmentation an melanoma detection (ISIC Challenge).

ISBI 2016

Best company at the Camelyon Challenge (“Cancer Metastasis Detection in Lymph Nodes”).

MICCAI 2015

Best company for identification of cancer cells on histological images (GlaS Challenge).

SemEval 2015

Second placing for English in semantical text similarity, and first place (by a large margin) for Spanish with a model that was also able to learn from the English data.

MultiLing 2015

Best company in multi-document text summarization in 38 languages, overall third place.

BioCreative 2015

Second best company in in “Medical Text Mining” for drugs, diseases and their interaction.

GermEval 2014

First and second overall place, best “Named Entity Recogniser” for German.

Contact us

We take time for your questions.
And we would be happy to show you a product demo.

Our industry solutions

The Cognitive Workbench in insurance:
Process business mail more reliably.

To date, less than half of incoming business mail at insurance companies can be processed automatically. With the text mining of the Cognitive Workbench, you can initiate a trend reversal. Thanks to a significantly higher recognition quality – e.g. for damage reports, certificates of incapacity for work, medical letters and customer letters – you can relieve your clerks and reduce your operational costs.