Published on
March 28, 2023

What is BERT?
BERT – Bidirectional Encoder Representation from Transformer states its working in the name itself. BERT is an open-source deep learning model which Google has introduced for Natural Language (NLP) Processing transfer Learning tasks. Recently with the advancement of technology and the higher demand in SEO, BERT has taken a significant role in the google search and presented the most significant links and the underlying text by search keywords.
BERT follows the unique technique in the representation of encoder and transformation of text data to the numeric vectors. It simply carries out the transformation task by working with the encoder. It reads the input data and carries out the predictions based on its learning. The unique strategy implemented by BERT is to read the given sequence at once and convert it to the encoded array. BERT has been proven as the state of the art model for context learning and other wide range of NLP tasks such as named entity recognition, responses by understanding the questions, sentiment analysis, and text classification/summarisation.
BERT facilitates the coder with the built-in application of NLP and transforming the textual data. These kinds of inclusion are not available in machine learning algorithms. If the machine learning classifiers are used for the text classification, both the classifier and the NLP algorithms need to be used. However, in the case of BERT, all these are included in a single transformer and using this, the entire classification operations can be successfully done.

Introduction to NLP

To understand BERT thoroughly one needs to understand NLP, Transformers and how do transformers work in NLP. Computers have historically had difficulty “understanding” language. Although computers are capable of reading, storing, and collecting textual inputs, they lack basic language context. As a result, Natural Language Processing, a branch of artificial intelligence, was developed that aims to make computers capable of reading, analysing, interpreting, and making sense of written and spoken language. In order to help computers in “understanding” human language, this practice combines linguistics, statistics, and machine learning.

Overview of Transformer in NLP

BERT was created on the Transformer architecture, a family of Neural Network architectures. Transformer architecture is built on the principle of self-attention, and the paper that first introduced it is titled “Attention Is All You Need“. Learning to weigh the significance of each item or word in relation to other words in the input sequence is known as self-attention. Attention, a powerful deep-learning technique first used in computer vision models, is the key to how transformers operate. Let’s consider an example to better comprehend attention: Can a human remember everything he saw on a given day? Certainly not! Our brains have limited but valuable memory. Our capacity to forget minor inputs aids our recall. In a similar vein, models of machine learning need to acquire the ability to concentrate solely on the relevant information rather than utilise computational resources to process irrelevant data. Differential weight signals are made by transformers to indicate which words in a sentence are most important for processing. This is accomplished by a transformer where it includes two separate mechanisms- an encoder that reads the text input and a decoder that produces a prediction for the task. However, BERT does not employ a decoder.

How does BERT work?

The training using BERT model completely depends on various ‘pre-trained BERT’ models. These models are offered with a range of parameters, from 110 million, known as “BERT-BASE” to 340 million, known as “BERT-LARGE”. The number of encoder layers ranging from 2 to 12 and the large number of hidden layers ranging from 128 to 768 vary greatly across these pre-trained models.

A series of tokens is the input to the BERT encoder. The tokens are first turned into vectors and then processed by the neural network. However, before processing can begin BERT requires the input to be manipulated and embellished with additional metadata such as:

  • Token Embedding: A [CLS] and [SEP] tokens are inserted at beginning of the first sentence and at the end of each sentence respectively.
  • Segment Embedding: For each token, a marker indicating Sentence A or Sentence B is added. As a result, the encoder can discriminate between different sentences.
  • Positional Embedding: A positional embedding is added to each token to indicate its position in the sentence.

To use the BERT tokenisation approach, the inputs must first be tokenised by the BERT tokeniser. The following two NLP tasks are utilised in BERT pre-training:

  • Masked Language Modelling (MLM)- MLM use the words that appear before and after the tokenised words to teach the model about the context of each word. A [MASK] token is used to replace 15% of the words in each word sequence before they are fed into BERT. Based on the context provided by the other non-masked words in the sequence, the model then attempts to predict the original value of the masked words. BERT learns information from input text in both the left and right directions of a token’s context during the training phase, which is called bidirectional.
  • Next Sentence Predictions (NSP)- Next sentence prediction (NSP) aims to establish a long-term relationship between sentences while MLM trains the relationship between words. The model learns to predict whether the second sentence in a pair is the subsequent sentence in the original document during the BERT training process. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. Prior to entering the model, the input is first processed as follows to help the model in making the distinction between the two sentences in training: A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.

Words are converted into numbers by BERT and this is the important process, because machine learning models takes input in numbers rather than words. This allows you to train machine learning models on your textual data. In other words, BERT models are used to transform your text data so that it can be combined with other types of data in a machine learning model to make predictions.

What makes BERT different-

Unlike other large learning models like GPT-3, BERT’s source code (view BERT’s code on Github) is publicly accessible allowing BERT to be more widely used all around the world. Now developers can quickly get a cutting-edge model like BERT up and running without spending a lot of time or money, instead they can focus on fine-tuning the BERT to tailor the model’s performance to their specific tasks. It’s important to keep in mind that if one don’t want to fine-tune BERT, thousands of open-source and free, pre-trained BERT models are currently offered for particular use cases. The fact that BERT is a pre-trained model and hence can be fine tuned with key features such as BERT needs much less data, choose relevant layers to tune, and it can perform transfer learning. Metrics can be fine-tuned and be used immediately. The BERT model is available and pre-trained in more than 100 languages and which can be useful for projects that are not English-based.

Implementation of BERT for Psychological Stress Detection

Let’s take a look at a real-world example now that we understand the fundamental ideas behind BERT. For this guide the dataset has been collected from the website Tweet Sentiment to CSV where its based on ‘Stress’ and ‘Relax’ moods reflected in the tweet.

The model has been trained using pre-trained BERT model in the following manner:


The tokeniser for BERT has been created using the pre-trained BERT model, namely “bert-base-uncased”. In this context, BertTokeniser has been used.

Data Encoding

The data (train and test) has been encoded using the tokeniser, and the final train and test data have been generated. In this context, truncation=True and padding=True have been ued as the encoding parameters.

Model Training

The model has been trained using the generated train data and using the following parameters:

training data batch size = 16
validation data batch size = 16

Install necessary libraries and packages such as:

!pip install transformers 
**from** transformers **import** * 
**from** transformers **import** BertTokenizer, TFBertModel, BertConfig, TFBertForSequenceClassification


Using pre-trained BERT model

bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=2)


Implementation of attention masking



bert_inp=bert_tokenizer.encode_plus(sent,add_special_tokens = True,max_length =64,pad_to_max_length = True,return_attention_mask = True)





Model training and fitting:

callbacks = [tf.keras.callbacks.ModelCheckpoint(filepath=model_save_path,save_weights_only=True,monitor='val_loss',mode='min',save_best_only=True),keras.callbacks.TensorBoard(log_dir=log_dir)]
print('\nBert Model',bert_model.summary())

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5,epsilon=1e-08)

trained_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=2)
trained_model.compile(loss=loss,optimizer=optimizer, metrics=[metric])
preds = trained_model.predict([val_inp,val_mask],batch_size=32)
for i in preds[0]:
  pred_labels = all_preds
  f1 = f1_score(val_label,pred_labels)
print('F1 score',round(f1,4)*100,"%")

print('Classification Report')


After classifying the tweets, the visualisation has been done for the confusion matrix. It has been seen that out of 334 observations, the prediction of 325 is correct, and 9 are misclassifications.


Undoubtedly, BERT represents a milestone in machine learning’s application to natural language processing. Future practical applications are anticipated to be numerous given how easy it is to use and how quickly it can be fine-tuned. It’s not an exaggeration to say that BERT has significantly altered the NLP landscape.