Full data science pipeline implementation

What is data enrichment? and its importance

Data enrichment is the process of combining first-party data from internal sources with disparate data from other internal systems or third-party data from external sources.

Usually, the data available from clients or stakeholders are not enough to solve the given problem statement, like if a client comes with a problem statement to build a recommendation engine for his mutual fund industry, the usual data they have is old purchase data but that's not enough as client behaviour changes with time and is impacted by the present market condition, oil prices, etc. which needs to be incorporated in the model to make it efficient.

Codes for this tutorial is at https://github.com/LoginRadius/engineering-blog-samples/tree/master/Data_Science/Full_DataScience_Pipeline_Implementation

The whole process id divided into four steps:

I have implemented a full pipeline of data science from scrapping data from web to implementing ml and NLP classification.

Phase I:

Here I have scraped data from IMDB website (imdb.py)

Phase II:

I have tried to implement simple ML regression on the data (ml_imdb.py)

Phase III:

I have prepared the data for NLP classification (multilabel_prep.py)

Phase IV:

I have implemented multilabel NLP classifier using various techniques like chain classifier etc. (multilabel_nlp_classifier.ipynb)

What is web scraping?

Web scraping is the process of extracting and parsing raw data from the web. Web scraping is a technique which helps data scientist to make their data-rich and is an efficient technique of data collection.

This world is full of data, but unfortunately, most of them are not in the form to be used. Data is like crude oil, or we say it is in unstructured form. For a data scientist or engineer, our first challenge is to make the data model consumption ready, which takes the majority of the time, and this whole process is collectively known as data preprocessing.

HTML is a form of primary markup language and the base framework of mostly all websites. For performing web scraping its necessary to know it

Here we will start with requesting the web page using python package requests.

1from requests import get
2  url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'
3  response = get(url)
4  print(len(response.text))

The whole web page is now stored in the variable object response. Then we parse the web page using beautifulsoup package.

1from bs4 import BeautifulSoup
2  html_soup = BeautifulSoup(response.text, 'html.parser')
3  type(html_soup)

Then I will store all the div with the class named lister-item mode-advanced in variable movie_containers.

1movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')

Then I iterate through this object and store the information in lists to make my final DataFrame, using simple for loops.

1## Lists to store the scraped data in
2names = []
3years = []
4imdb_ratings = []
5metascores = []
6votes = []
7#gross=[] #many movies have no record
8movie_description=[]
9movie_duration=[]
10movie_genre=[]
11## Extract data from individual movie container
12for container in movie_containers:
13## If the movie has Metascore, then extract:
14    if container.find('div', class_ = 'ratings-metascore') is not None:
15## The name
16        name = container.h3.a.text
17        names.append(name)
18## The year
19        year = container.h3.find('span', class_ = 'lister-item-year').text
20        years.append(year)
21## The IMDB rating
22        imdb = float(container.strong.text)
23        imdb_ratings.append(imdb)
24## The Metascore
25        m_score = container.find('span', class_ = 'metascore').text
26        metascores.append(int(m_score))
27## The number of votes
28        vote = container.find('span', attrs = {'name':'nv'})['data-value']
29        votes.append(int(vote))
30## Gross income of movie
31        #gross_inc =container.find_all('span', attrs = {'name':'nv'})[1]['data-value']
32        #gross.append(gross_inc)
33movie description
34    movie_desc=container.find_all('p', class_ = 'text-muted')[1].text
35    movie_description.append(movie_desc)
36    movie_det=container.find_all('p', class_ = 'text-muted')[0]
37
38Movie duration
39    movie_dur=movie_det.find('span',class_='runtime').text
40    movie_duration.append(movie_dur)
41
42Movie genre
43    movie_gen=movie_det.find('span',class_='genre').text
44    movie_genre.append(movie_gen)
45
46import pandas as pd
47one_df = pd.DataFrame({'movie': names,
48'year': years,
49'imdb': imdb_ratings,
50'metascore': metascores,
51'votes': votes,
52#'gross':gross,
53'movie decription':movie_description,
54'movie duration':movie_duration,
55'movie genre':movie_genre
56})
57print(one_df.info())
58one_df.to_csv('50_movie_details.csv')

But this was only for one page which has data for 50 movies only which is not enough to build a model.

Please refer my code to understand how I use simple for loops to iterate through all the movies and downloading data for 20 years(approx).

Implementing simple linear algorithms in numerical data we just scrapped

Whats is linear regression??

It is one of the most popular and used statistical techniques • Used to understand the relationship between variables

• Can also be used to predict a value of interest for new observations

• The aim is to predict the value of a continuous numeric variable of interest (known as the response or dependent or target variable)

• The values of one or more predictor (or independent) variables are used to make the prediction

• One predictor = simple regression

• More predictors = multiple regression

Here I just tried to use metascore of movies firstly to predict IMDB ratings and secondly I wanted to enhance it by using metascore and votes to predict IMDB rating.

1### ML model
2X = data.loc[:, 'metascore'].values
3y = data.loc[:, 'imdb'].values
4Splitting the dataset into the Training set and Test set
5from sklearn.cross_validation import train_test_split
6X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)
7from sklearn.linear_model import LinearRegression
8regressor = LinearRegression()#making object for reg package
9regressor.fit(X_train.reshape(-1,1), y_train.reshape(-1,1))#to fit the regressor to our training data
10#predict the test results
11y_pred =regressor.predict(X_test.reshape(-1,1))
12from sklearn.metrics import mean_squared_error
13mean_squared_error(y_test, y_pred)
140.18041462828221905

1### Let try with imdb and votes
2X1 = data.loc[:, ['metascore','votes']].values
3y1 = data.loc[:, 'imdb'].values
4
5## Splitting the dataset into the Training set and Test set
6from sklearn.cross_validation import train_test_split
7X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size = 0.33, random_state = 0)
8
9from sklearn.linear_model import LinearRegression
10
11regressor = LinearRegression()#making object for reg package
12regressor.fit(X_train, y_train)#to fit the regressor to our training data
13
14#predict the test results
15y_pred =regressor.predict(X_test)
16
17
18from sklearn.metrics import mean_squared_error
19mean_squared_error(y_test, y_pred)
20## 0.15729132122310804 good score

I tried to scrape data from the IMDB site and then applied ML regression techniques on it. Later I found that the movies listed are multi-class like Logan belongs to Action, Drama, Sci-Fi, which led me to think about how to implement the classifier model in the multilabel data. Usually, the data we get in real-world is mostly multi labelled like chatbot data; the intent is many and like these movies which are multi-class.

Here we will first see how we prep our data for multilabel classification.

Here we have all tags in one single column which is not usable while we do classification, so we have to make separate columns for all labels, and if the row doesn't belong to that category, it will be filled by 0 else 1.

1import os
2os.chdir('Desktop/web_scraping/imdb scrapper_ml/')
3import pandas as pd
4data=pd.read_csv('multilabel_nlp_classification.csv')
5movie_list=[x for x in data['movie genre']]
6movie_list1=''
7for x in data['movie genre']:
8movie_list1+=','+x
9li_m=movie_list1.split(',')
10li=[x.strip() for x in li_m]
11list_s=list(set(li))
12for x in list_s:
13data[x]=0
14data['movie_genre']=[x.strip().split(',') for x in data['movie genre']]
15de=data.copy()
16#data.loc[0,'Action']=1
17de['id']=range(0,6116)
18#print(de.loc[de['id']==0,'Action'])
19for i in range(0,6116):
20for x in de.loc[de['id']==i,'movie_genre']:
21for y in x:
22y=y.strip()
23de.loc[de['id']==i,y]=1
24de.to_csv('multilabel_nlp_classification.csv')

Now, as our data is ready, we can start with NLP implementation.

For multilabel classification, I used techniques like classifier chain, label powerset, etc.

Here the problem statement is that using the movie description our model has to guess which genre the movie belongs to. It is a popular use case. Take an example of ecommerce product description data; now instead of manually assigning the labels to it, we can use a model which will find relevant labels or genre for it and make the content relevant to the type it belongs.

I start with Exploratory data analysis and then data cleaning, which is the most crucial step as if all the description has some very 30-50 common words it will simply make the data-heavy and model slow and inefficient.

Then we go on to make the data model ready as ML models don't understand text data we have to feed numbers in it. For that purpose, we use TfidfVectorizer.

What is TfidfVectorizer?

TfidfVectorizer - Transforms text to feature vectors that can be used as input to the estimator.

Then simply diving the data in train and test split.

1x_train = vectorizer.transform(train_text)
2y_train = train.drop(labels = ['id','movie decription'], axis=1)
3x_test = vectorizer.transform(test_text)
4y_test = test.drop(labels = ['id','movie decription'], axis=1)

I tried first with applying logistic regression and one vs rest classifier.

What is OneVsRestClassifier??

OneVsRestClassifier strategy splits a multi-class classification into one binary classification problem per class. OneVsRestClassifier is when we want to do multi-class or multilabel classification, and its strategy consists of fitting one classifier per class. For each classifier, the class is fitted against all the other classes.

1## Using pipeline for applying logistic regression and one vs rest classifier
2LogReg_pipeline = Pipeline([
3                ('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1)),
4            ])
5for category in categories:
6printmd('Processing {} comments...'.format(category))
7## Training logistic regression model on train data
8LogReg_pipeline.fit(x_train, train[category])
9
10## calculating test accuracy
11prediction = LogReg_pipeline.predict(x_test)
12print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
13print("\n")

Next, I tried with BinaryRelevance

What is BinaryRelevance?

It is a simple technique which treats each label as a separate single class classification problem.

1## using binary relevance
2from skmultilearn.problem_transform import BinaryRelevance
3from sklearn.naive_bayes import GaussianNB
4initialize binary relevance multi-label classifier
5with a gaussian naive bayes base classifier
6classifier = BinaryRelevance(GaussianNB())
7train
8classifier.fit(x_train, y_train)
9predict
10predictions = classifier.predict(x_test)

Next, I tried using ClassifierChain.

What is ClassifierChain?

It is almost similar to BinaryRelevance, here the first classifier is trained just on the input data, and then each next classifier is trained on the input space and all the previous classifiers in the chain.

1from skmultilearn.problem_transform import ClassifierChain
2from sklearn.linear_model import LogisticRegression
3initialize classifier chains multi-label classifier
4classifier = ClassifierChain(LogisticRegression())
5Training logistic regression model on train data
6classifier.fit(x_train, y_train)
7predict
8predictions = classifier.predict(x_test)

Next, I tried using Label Powerset.

What is LabelPowerset?

Here we transform the problem into a multi-class problem with one multi-class classifier is trained on all unique label combinations found in the training data.

1from skmultilearn.problem_transform import LabelPowerset
2## initialize label powerset multi-label classifier
3classifier = LabelPowerset(LogisticRegression())
4train
5classifier.fit(x_train, y_train)
6predict
7predictions = classifier.predict(x_test)

Please refer my notebook multilabel_nlp_classifier.ipynb from my repo for more details.

Improvement:

More feature engineering and data to avoid this overfitting and make more efficient pipeline
If we collect more data, deep learning and state of the art algorithms like BERT can help us to leverage the efficiency of the model.

Summary:

We have learnt how to collect data by web scraping and tools to perform the same.
We completed the modelling techniques on in numerical data
We prepared the label data to be model fed ready
We learnt how different ML techniques could be applied to text data and build a multilabel classifier.