Artificial Intelligence makes many things possible. We’re making them happen.
Our Research in numbers
For example, in NLP in academic settings the usual approach is to concentrate exclusively on pure texts, such as news articles. The reality is often completely different, involving processing of bills, insurance claims, complaints, surveys, Excel tables, PowerPoint slides and documents with little linguistic information, but a lot of meaningful visual text block placement. The most common use case is the extraction of information from tables or other such semi-structured data. Extraction of information from pure texts with proper sentences, phrases or other linguistic structures is actually quite rare.
As a result, very advanced techniques common for NLP (such as high-end parsing solutions) often have little to no influence on the achieved scores. On the other hand, taking visual placement of text blocks on the original document into account has a tremendous influence. For us, this means that we must go beyond these well-trodden paths and combine multiple data modalities in unique ways, such as for visual document analysis.
Applied research as a methodical work process
Specifically this means: analyzing data, reading literature about the state-of-the art, selecting the most promising experiments, and reproducing them. Going one step further means verifying their effectiveness on real data, rework them together with our software engineers such that they always stay within certain compute- and space limits, for example.
As a result we get a valuable software solution, with a properly documented approach, detailed measurements on dozens of different data sets. All of that as a whole, integrated product, the Cognitive Workbench.
Comparability and transparency
Every single AI training is subject to five-fold crossvalidation. This means that four fifth of the data will be used for training, while the last part will remain unknown to the AI and later be used to evaluate its power to generalize to unseen data. This is repeated for each possible fold. Finally, the training is applied on all available data for the productive, deployable trained model.
Each training results in a trained model. This is a resource that can be packaged with all the necessary run-time components and exported in the form of a docker image, which can then be installed anywhere and used on any amount of data (e.g. via a REST-API). The trained model also contains a lot of additional information available during training. This includes any achieved scores during the crossvalidation, information about the training data, who performed the training, etc.
This allows to inspect any trained model, compare it with other trainings on the same data, or decide to export and deploy it for productive usage.
Unsupervised training on large data sets
Unsupervised training may run against a full web crawl, consisting of hundreds of millions sentences. This would produce a language model. Something that encodes the general usage of words, sentences, etc. of the language in question. This language model is independent from what it is going to be used for later on. It encodes a lot of linguistic knowledge, such as word similarity (“house” and “building” are semantically similar), syntax of words (“running” and “tidying” have a similar sentence function), ambiguity, etc. The unsupervised training process derives all of this information by itself from the data. Imagine seeing many sentences with contexts like “the house was built in” or “the building was built in”. Even without any knowledge in common English everyone would be able to deduce that “house” and “building” appear to be interchangeable.
Unsupervised learning evolved a lot in the past 10 years. It went from complicated co-occurrence counting based methods specifically for semantic vs. syntactic unsupervised models towards the current state-of-art of Neural Network based, fully contextualized, character-level-based language model that assign an individual vector to each word.
This vector has many properties, such as being similar (but not necessarily equal) to the vector of the same word in a slightly different context. Or being different for the same word in a completely different context. The character-based approach allows to assign good vectors even to words that were not seen in the unsupervised training (so-called out-of-vocabulary words). It also means that words with OCR errors in them will get sensible vectors most of the time. These modern methods are simpler by design, but much more demanding in terms of required compute power.
As of 2019, training a good language model requires at least a small super computer with four Tesla GPUs and will train for a couple of weeks.
Supervised training is the task to find a model that reproduces a given set of training examples, and generalizes to another set of test data, which was not seen during training. An additional validation data set is often used in-between for tuning hyper parameters of the supervised training.
Traditionally, algorithms like Support Vector Machine (SVM), or Conditional Random Field (CRF) were used in this area. Recently, though, supervised training is usually implemented by using neural network solutions with components like convolutional neural networks (CNN), or Long-Short-Term-Memory (LSTM).
ExB developed an own architecture for separating feature generation (designed, or pre-learned) from actual machine learning, so that the machine learning algorithm can be easily exchanged. At the same time, feature generation can also be easily expanded without affecting the machine learning layer in any (negative) way.
Best score at the Lesion Segmentation an melanoma detection (ISIC Challenge).
Best company at the Camelyon Challenge (“Cancer Metastasis Detection in Lymph Nodes”).
Best company for identification of cancer cells on histological images (GlaS Challenge).
Second placing for English in semantical text similarity, and first place (by a large margin) for Spanish with a model that was also able to learn from the English data.
Best company in multi-document text summarization in 38 languages, overall third place.
Second best company in in “Medical Text Mining” for drugs, diseases and their interaction.
First and second overall place, best “Named Entity Recogniser” for German.
We take time for your questions.
And we would be happy to show you a product demo.
The Cognitive Workbench in insurance:
Process business mail more reliably.
To date, less than half of incoming business mail at insurance companies can be processed automatically. With the text mining of the Cognitive Workbench, you can initiate a trend reversal. Thanks to a significantly higher recognition quality – e.g. for damage reports, certificates of incapacity for work, medical letters and customer letters – you can relieve your clerks and reduce your operational costs.