CROSS-LANGUAGE TEXT CLASSIFICATION WITH CONVOLUTIONAL NEURAL NETWORKS FROM SCRATCH

Cross language classification is an important task in multilingual learning, where documents in different languages often share the same set of categories. The main goal is to reduce the labeling cost of training classification model for each individual language. The novel approach by using Convolutional Neural Networks for multilingual language classification is proposed in this article. It learns representation of knowledge gained from languages. Moreover, current method works for new individual language, which was not used in training. The results of empirical study on large dataset of 21 languages demonstrate robustness and competitiveness of the presented approach.


Introduction
Text classification or text categorization problem is currently one of the most observed in the field of information and computer sciences. The task is to assign a text to one or more classes or categories. Classification is a popular technique and has been applied to spam filtering, email routing, sentiment analysis and etc. This may be done "manually" or algorithmically. Machine learning is an outstanding approach in automated text classification [1,2] and categorization [3]. Moreover, text clustering or unsupervised text classification has been used to enhance the information retrieval. This is based on the clustering hypothesis, which states that text having similar contents is also relevant to the same query [4]. A fixed collection of text is clustered into groups or clusters that have similar contents. The similarity between texts is usually measured with the associative coefficients from the vector space model, e. g., the cosine coefficient.
There are numerous automatic text classification techniques that include: − Expectation maximization (EM); − Naive Bayes classifier; − Tf-idf; − Latent Semantic Indexing; − Latent Dirichlet Allocation; − Support Vector Machines (SVM); − K-nearest neighbor algorithms; − Decision Trees such as ID3 or C4.5. Besides these techniques, there are also neural networks and deep learning approaches that were responsible for the boost in Machine Learning [5]. The work is focused on Convolutional Neural Networks (CNNs). It is commonly assumed that CNN is only used in Computer Vision [6] and it is particularly true, because convolutional neural networks are major breakthroughs in image classification tasks and now are the most successful approach in most Computer Vision Systems.
CNNs are made up of neurons that have learnable weights and biases as ordinary neural networks. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores on the other. Moreover, they still have a loss function (e. g. SVM/Softmax) on the last fully-connected layer and all the tips and tricks that were developed for learning regular neural networks still apply.
Recently, CNNs are applied to problems in Natural Language Processing (NLP) and the obtained results are promising and interesting for the future research.
Most machine learning approaches on text understanding and classification consist in tokenizing a string of characters into structures such as words, phrases, sentences, and paragraphs. Usually, some statistical classification algorithm is applied to the statistics of these structures [7]. These techniques work well enough for a narrowly defined domain. However, the prior knowledge is required: a pre-defined dictionary of interested words, a parser to handle variations such as word morphological changes and specialization to a particular language.
With the advancement of deep learning and the availability of a large dataset, methods of handling text understanding using deep learning techniques have gradually become available. One technique that draws great interests is word2vec [8]. Inspired by traditional language models, this technique constructs a representation of words into a vector of fixed length trained under a large corpus. Numerous of techniques try to apply these representation or similar approaches with an engineered language model and have good results for various tasks [2]. Moreover, the neural networks will perform multilingual classification. The term "multilingual" came from the information retrieval and it also has a specific meaning in cross-language information retrieval, where a text collection is multilingual. This article shows that cross-language multi-label text classification can be handled by a deep learning system without artificially embedding knowledge about words, phrases, sentences or any other syntactic or semantic structures associated with a language. In the following sections, firstly, related work is reviewed, and then introduce the multilingual dataset. For the first time the problem of cross-language text classification was explicitly considered in cross-lingual text categorization paper [9], which is predated by work in cross-language information retrieval (CLIR), where similar problems are addressed [10].
Traditional approaches to cross-language text classification or/and CLIR are connected with the usage of linguistic resources such as bilingual dictionaries or parallel corpora to induce correspondences between two languages [11,12], for example, method [13] which performs latent semantic analysis on a parallel corpus and is considered as seminal work in CLIR. Later on it was improved by Li and Taylor [14] by employing kernel canonical correlation analysis instead of latent semantic analysis. The limitations of these approaches are: dependence on parallel corpus and computational complexity. The use of automatic machine translation approaches was studied in cross-language text clarification [12,15]. These methods typically translate the text into the source or target language.
Usually, the second step is to reduce the noise introduced by the machine translation using dimensionality reduction [16] or semi-supervised learning [15]. Obviously, the performance of such classifiers depends on the quality of the automatic machine translation. The paper introduces an approach for cross-language multi-label text classification by using deep convolutional neural network. Our extensive empirical study on a large dataset of multilingual texts suggests that the proposed solution has multiple benefits. First of all, it does not require parallel data between all languages or bilingual dictionaries, what is often a bottleneck, because in many real world scenarios such parallel data may not be available. Secondly, there is no need to use artificially embedding knowledge about words, phrases, sentences or any other syntactic or semantic structures associated Computer Sciences and Mathematics with a language. Finally, representation of text for any language is learned by neural network and could be used in other application.

Materials and Methods
In order to train and test CNN for cross-language multi-label classification proper data is required. The dataset was created based on a text extracted from Wikipedia articles, for this purpose, the crawler that iterates over DBpedia (public dataset of Wikipedia) was implemented. The dataset consists of article URLs, categories and their confidence. The texts of the articles were collected by crawling of the corresponding URLs. The dataset of 23 languages with defined multiple categories per every article ( Table 1) was constructed.
The numbers of articles in Portuguese and Swedish are bigger than, for example, Bulgarian. Besides, every category has a different number of articles and it is assigned to across different languages. Moreover, every article has a different size, the smallest ones could have two sentences, the largest approximately ten pages of text. That is why, in order to have a "balance" dataset, the length of each article equal to 1.500 words was fixed, the longer text was simply cut, the shorter was padded (special token was appending). The number of articles per every category varies from 150 to 400 per language and respectively the total number of articles per language after pre-processing is in range: 39.000−78.000. It was filtered out the exceptional categories that are assigned for less than 150 articles across the whole dataset ( Fig. 1). To this end, the dataset contains almost 450.000 articles and more than 300 categories. In this article, all the results are presented based on collected data. The dataset does not come with an official train/test split, so it was simply divided into train, validation and test sets, 70 %, 15 % and 15 % accordingly. Data pre-processing steps are: − load sentences from the raw data files; − clean the text data using regular expression in order to match only words; − append special token to each article text with the length less than 1.500 words; − this step is required, because each example in a batch must be of the same length; − build a vocabulary index and map each word to an integer between 0 and the vocabulary size. Each article text becomes a vector of integers; − build a category index and map each category to an integer between 0 and the number of categories. For each article text, a vector of integers was gotten. The last two steps allow to represent the text as a list of numbers on the word level. However, the vocabulary size grows with every additional language and consists of all words from training set. To solve this problem word embedding technique is used, it was originally introduced Computer Sciences and Mathematics by Bengio more than a decade ago [17]. A word embedding n W : words R → is a parameterized function mapping words from vocabulary to high-dimensional vectors. The embedding layer is the first stage in this model: input is a list of numbers (each number is index of word in the vocabulary), output is 512-dimensional vector. The mapping function is learned by training procedure in order to have meaningful vectors for multilingual classification task. Let assume k i xR ∈ is the k-dimensional word vector corresponding to the i-th word in the text of article. Then an article will be the concatenation of x i , where n is the number of words in an article. Applying convolutional networks to text classification or natural language processing at large was explored in the literature. It has been shown that CNNs can be directly applied to distributed [18,19] or discrete [20] embedding of words, without any knowledge of the syntactic or semantic structures of a language. These approaches have been proven to be competitive with traditional models. This article explores treating the text as a kind of raw signal at the word level, and applying temporal (one-dimensional) CNN to it on purpose to assign categories. The example is presented in the Fig. 2. Computer Sciences and Mathematics

Fig. 2. Model architecture with two channels for an example sentence [19]
The model architecture is a deeper extended variant similar of Kim Yoon's CNNs for sentence classification [19] and inspired by Collobert [21], Kalchbrenne [22] and Alexis Conneau [23]. The convolution operation involves a filter that defined as  cR . + ∈ On the next step a max pooling operation [21] is used over the feature map. As a characteristic of a particular filter maximum value ( ) c max c ′ = is chosen. The main purpose is to capture the most important characteristic for the whole feature map. This pooling scheme is common for computer vision approach and also it naturally deals with article lengths. The model performs convolution operations over { } 1:n x vectors of article representation. The data is passed from embedding layer to the first convolutional layer that is defined the 32 window size filters, then to the second with 16 window size. The third layer is combination of three convolutional ones with 3, 4 and 5 window sizes 2, each layer takes as an input the output of second layer and produces separate outcomes, as showed in Fig. 3. The outcomes are concatenated and feed to the feedforward layer.
Moreover, it is important to mention that for all convolutional operations the stride size is 1 (i. e. the shifting length of filters at each step) and wide convolution or zero-padding is used (i. e. adding zero elements on the matrix edges for elements that don't have all neighboring elements). The ReLU is activation function that was chosen because of speed characteristics. The next component is a fully connected layer, each input neuron is connected to each output neuron in the next layer, which output is the probability distribution over categories (Fig. 4).
In the "standard" multinomial classification problem the category prediction requires to have as many output neurons as categories.

Computer Sciences and Mathematics
A neural network is usually trained with a cross-entropy cost function that requires values of the output neurons to represent probabilities -which means that the output "scores" computed by the network for each category have to be normalized, converted into actual probabilities for each category. This normalization step is achieved by means of the softmax function, which is computational expensive when applying for large output layer. That is why the Noise Contrastive Estimation (NCE) [24] was chosen as a lost function. Each training example { } i 1:n (x , T ) consists of a context, in our case it is text, and a small set of target categories. The following is used as shorthand for the expected count of a category in the set of target categories for a context. In the case of sets with no duplicates, this is the probability of the category given the context: The main purpose is to train a function ( ) F x, y to approximate the log probability of the category given the context. As a result, the training task is to distinguish the true candidates from the sampled candidates. There are one positive training meta-example for each element of i T and one negative training meta-example for each element of i S . Let's introduce the shorthand Q(y | x) to denote the expected count, according to our sampling algorithm, of a particular category in the set of sampled categories.
where S never contains duplicates, P y|x logodds y came from T vs S|x log log P y|x log Q y|x . Q y|x The first term, log(P(y | x)), is what the function ( ) F x, y would be trained to estimate. In the model there is a layer, which represents ( ) F x, y . The second term, log(Q(y | x)), which is computed analytically, is added to it and the result is passed to a logistic regression loss function, whose "label" indicates whether y came from T as opposed to S.

( )
x, y is trained by the back propagation signal. The dropout technique was employed for regularization with a constraint on l2-norms of the weight vectors [25]. The idea behind is to randomly disable neuron that helps to prevent the co-adaptation of hidden units. Let assume '' 1m z c , ..., c  =  is convolutional layer, where m corresponds to the number of filters. For computing the output unit dropout uses "masking" vector r. Therefore, in forward propagation, the output unit y was computed as y w * z ' b =+ where z' is element-wise multiplication of convolutional layer and "masking" vector z z · r = ′ . Consequently, gradients are backpropagated only through the unmasked units and dropout is used only during the training stage. Additionally, after a gradient descent step the l2-norms constraint is applied simply by rescaling weight vectors, whenever i w s > set i w s = [19]. Training is done through stochastic gradient descent over shuffled mini-batches with the Adam update rule [1].

Experimental procedures
For convolution neural network model, it is available a lot of hyperparameters for tuning. In order to build our CNN model, an empirical evaluation for hyperparameters was performed and analysis of the effect of hyperparameters on accuracy and productivity was carried out. The model was trained with the following hyperparameters: All the experiments were run on Amazon "p2.xlarge" instance with the following hardware characteristics: NVIDIA K80 GPU with 2,496 parallel processing cores and 12GiB of GPU memory, 16GiB of RAM and Intel Xeon E5-2686v4 CPU.

Results
Whereas the correct categories were collected together with articles for validation set, the accuracy is calculated simply by comparison the predicted categories with correct ones. Accuracy for each example (text, categories) is defined as the ratio of correctly predicted categories to the total number, sum of predicted and actual, of text categories.
The accuracy on the validation set is equal to 84.56±2.6 %. In order to validate our hypothesis about transfer learning articles collected for German language were used. The described above model shows 63.34 +=1.98 % for text written on totally new language, German was not included in our training and testing sets.

Discussion
First of all, according to Fig. 5, the values of the loss function for train and test dataset, blue and red colors on the figure respectively, start to diverge after 8 thousand samples. This is a strong sign of data overfitting and thinks that increasing the number of words per article will result in better model generalization. Also, it was observed that max-pooling always beat the average pooling and filter sizes depend on concrete task. However, it is found that the L2 norm constraints on the weight vectors and has a little effect on the end result, the similar conclusion was made by Y. Zhang [1].

Computer Sciences and Mathematics
The proposed model has two strong sides: CNNs can be used from scratch without feature extractor and they do not require knowledge of the language, syntax or semantic structures, perform well on new languages and as a result could be used for transfer learning. Even though, there is still a lot of space for architecture research of a model. The accuracy for text in German language show that model effectively applies learned pattern and structure information from other language. It is assumed that approximately 63 % of accuracy is received because of model training on language from the same group (i. e Danish, Norwegian and Swedish).

Conclusions
Summing up, CNNs show good results at cross-language classification task and would be a good fit for natural language processing tasks. Despite the fact that convolutional neural networks do not have such clear intuition for location invariance and compositionality as they do in image tasks, it turns out that CNNs perform quite well on NLP problems. The approach in this article works for multilingual long-form texts such as multi-language Wikipedia articles. Additionally, the classification for new language shows promising result but require further research. Intuitively, it makes sense that using pre-trained word embedding would yield larger gains of accuracy for our model, but we used only one-hot words representation and learned such representation during training procedure.
A limitation of this research is that all articles have the same length, as a result, this method may not be a good choice for data that looks considerably different. Although, as a solution, dynamic graph creation approach could be used.
The proposed solution could be used in variety of applications such as search engines, e-libraries, knowledge bases, any information retrieval systems and software that work with multilingual documents.