Add Thesis

Automatic language identification of short texts

Written by A. Avenberg

Paper category

Master Thesis


Computer Science




Thesis: Natural language processing Natural language processing (NLP) is a field in computer science and linguistics, which covers the field of how to describe, represent, use, and construct languages ​​in a computational manner. NLP, also known as computational linguistics, has been around since the 1980s [3] and includes several different fields. Some examples of NLP topics are natural language modeling, information extraction, text recognition, text translation, question answering, and summarization [3]. With the rapid growth of computing power and parallelization in recent years, machine learning and deep learning can now be applied to the field of NLP [3]. Some examples of machine learning problems in NLP are speech recognition, machine translation, sentiment analysis, and automatic language recognition. 2.1.1 Automatic Language Recognition Automatic Language Recognition (LID) is designed to recognize language without human intervention [4]. LID is a process that exists in many Web services today. When searching the web, many websites have the LID of the text in the search bar, and the most relevant search results will be exposed first. Another example is a translation tool that automatically recognizes the written language and then translates it into the desired language. In machine translation, the automatic translation of documents, the first step before automatic translation is to have a well-functioning LID model. Language recognition is a key part of natural language processing. In general, artificial intelligence and machine learning try to imitate the way the human brain works, which will be explained further in the following section [5]. Since the 1960s, several different methods have been studied in the field of automatic language recognition [4], but just like artificial intelligence, its performance has been rapidly improved in the past 10 years. LID can be used for data from speech, sign language or text. 2.2 Artificial Intelligence Artificial Intelligence (AI) is a part of computer science and has become one of the latest and most advanced topics in today's technology. AI has been around much longer than expected, and it was introduced as early as the 1950s. The early ideas of AIstill still exist today; the intelligence of machines, and attempts to imitate the intelligence of humans and animals. Today, artificial intelligence is widely used in many different applications and can include many different things. However, artificial intelligence has different definitions, and it is used as a broad term in many fields. The main idea of ​​artificial intelligence is to allow machines to complete tasks that usually require human intelligence. 2.3 Machine learning is similar to artificial intelligence, and there is no clear definition of machine learning (ML). In 1959, Arthur Samuel proposed a common definition: "Machine learning enables computers to learn without explicit programming. Machine learning is artificial intelligence, or can be defined as a subfield of artificial intelligence. The general idea of ​​ML is to learn or recognize patterns in existing data sets. This will create a model that can be used for future applications that introduce new data and run through the model, and give results based on previously seen data based on tasks. ML can be described as generalizing a model from a set of task-specific data so that it can be used in the future for another set of similar data for the same task. Models and algorithms can be used. There are two distinct problems in ML, regression problems and classification problems. The output required by the regression problem to deal with the numerical value, such as the price of a house or the temperature of the data is quantitative. On the other hand, the classification problem deals with the classification output is a specific label, such as true or false, or a label of several categories, such as a specific language, where the data is qualitative [9]. The mathematical algorithms behind many of the most commonly used machine learning models have existed for more than 50 years, and probability theory is based on old knowledge, such as Bayes' theorem [10]. There are three main methods in machine learning: supervised machine learning, unsupervised machine learning and reinforcement learning [11]. Supervised learning is the most widely studied and used type of machine learning. In supervised machine learning, the data set used to train the machine learning model has a label for each data sample. For example; if the data set consists of animal pictures, each picture also includes the correct label of the animal in the picture, such as "dog" or "cat". Both regression and classification problems can be performed through supervised learning. Super-supervised learning requires labeled data sets for training and testing. In unsupervised learning and reinforcement learning, the data set does not have a label for each sample. This means that the learning task is different from supervised learning. Neither the regression problem nor the classification problem can be solved here. Unsupervised learning only has data samples as input, and knows nothing else. The model can learn patterns in the data set and cluster similar data together. Unsupervised learning can be used as a pre-model for supervised models, or to understand how samples in a data set are related to each other. Reinforcement learning does not contain the label of each data sample in the data set, but the data contains other information about how different outputs are scaled differently according to the task under investigation. Generally, the input of a model includes the described action, several different outputs of the action, and the level of each output [11]. Reinforcement learning can be well used to train models to play games, such as chess, or to simulate animal behavior. Read Less