You are reading the article Data Engineering For Beginners – Partitioning Vs Bucketing In Apache Hive updated in November 2023 on the website Cancandonuts.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested December 2023 Data Engineering For Beginners – Partitioning Vs Bucketing In Apache HiveOverview
Understand the meaning of partitioning and bucketing in the Hive in detail.
We will see, how to create partitions and buckets in the HiveIntroduction
You might have seen an encyclopedia in your school or college library. It is a set of books that will give you information about almost anything. Do you know what is the best thing about the encyclopedia?
Yes, you guessed it correctly. The words are arranged alphabetically. For example, you have a word in mind “Pyramids”. You will directly go and pick up the book with the title “P”. You don’t have to search that in other books. Can you imagine how tough would the task be to search for a single book if they were stored without any order?
Here storing the words alphabetically represents indexing, but using a different location for the words that start from the same character is known as bucketing.
Similar kinds of storage techniques like partitioning and bucketing are there in Apache Hive so that we can get faster results for the search queries. In this article, we will see what is partitioning and bucketing, and when to use which one?Table of Contents
What is Partitioning?
When to use Partitioning?
What is Bucketing?
When to use Bucketing?What is Partitioning?
Apache Hive allows us to organize the table into multiple partitions where we can group the same kind of data together. It is used for distributing the load horizontally. Let’s understand it with an example:
Suppose we have to create a table in the hive which contains the product details for a fashion e-commerce company. It has the following columns:
Now, the first filter that most of the customer uses is Gender then they select categories like Shirt, its size, and color. Let’s see how to create the partitions for this example.CREATE TABLE products ( product_id string, brand string, size string, discount float, price float ) PARTITIONED BY (gender string, category string, color string);
Now, the hive will store the data in the directory structure like:/user/hive/warehouse/mytable/gender=male/category=shoes/color=black
Partitioning the data gives us performance benefits and also helps us in organizing the data. Now, let’s see when to use the partitioning in the hive.When to use Partitioning?
When the column with a high search query has low cardinality. For example, if you create a partition by the country name then a maximum of 195 partitions will be made and these number of directories are manageable by the hive.
On the other hand, do not create partitions on the columns with very high cardinality. For example- product IDs, timestamp, and price because will create millions of directories which will be impossible for the hive to manage.
It is effective when the data volume in each partition is not very high. For example, if you have the airline data and you want to calculate the total number of flights in a day. In that case, the result will take more time to calculate over the partition “Dubai” as it has one of the busiest airports in the world whereas for the country like “Albania” will return results quicker.What is Bucketing?
In the above example, we know that we cannot create a partition over the column price because its data type is float and there is an infinite number of unique prices are possible.
Hive will have to generate a separate directory for each of the unique prices and it would be very difficult for the hive to manage these. Instead of this, we can manually define the number of buckets we want for such columns.
In bucketing, the partitions can be subdivided into buckets based on the hash function of a column. It gives extra structure to the data which can be used for more efficient queries.CREATE TABLE products ( product_id string, brand string, size string, discount float, price float ) PARTITIONED BY (gender string, category string, color string) CLUSTERED BY (price) INTO 50 BUCKETS;
Now, only 50 buckets will be created no matter how many unique values are there in the price column. For example, in the first bucket, all the products with a price [ 0 – 500 ] will go, and in the next bucket products with a price [ 500 – 200 ] and so on.When to use Bucketing?
We cannot do partitioning on a column with very high cardinality. Too many partitions will result in multiple Hadoop files which will increase the load on the same node as it has to carry the metadata of each of the partitions.
If some map-side joins are involved in your queries, then bucketed tables are a good option. Map side join is a process where two tables are joins using the map function only without any reduced function. I would recommed you to go through this article for more understanding about map-side joins: Map Side Joins in HiveEnd Notes
In this article, we have seen what is partitioning and bucketing, how to create them, and are pros and cons of them.
I would highly recommend you go through the following resources to learn more about Apache Hive:
You're reading Data Engineering For Beginners – Partitioning Vs Bucketing In Apache Hive
Analytics Insight features top data science training courses for beginners in 2023
Data science is thriving in the global tech-driven market owing to its unprecedented potential to help organizations make smarter decisions to yield higher revenue efficiently. Aspiring data scientists are aiming to join popular and reputed global companies to have a successful professional career in data management. But to add value to CVs in this competitive world, these aspiring data scientists should have a strong understanding of concepts and mechanisms of data science. It must be overwhelming to select any one data science course that includes data management and data visualization. Thus, let’s explore some of the top data science training courses for beginners to learn in 2023.Top data science training courses for aspiring data scientists Applied Data Science with Python from Michigan University at Coursera
Applied Data Science with Python from Michigan University at Coursera is one of the top data science training courses for aspiring data scientists. They can learn to apply data science methods and techniques by enrolling for free from today. Beginners can conduct an inferential statistical analysis, data visualization, data analysis, and many more. There are five courses for aspiring data scientists to learn data science through Python. There are flexible schedules with approximately five months to complete and earn a valuable certificate. This course consists of hands-on projects for a strong practical understanding of the subject.Introduction to Data Science using Python at Udemy
Introduction to Data Science using Python at Udemy offers aspiring data scientists to understand the basics of data science and analytics, Python and Scikit learn with online video content, a valuable certificate, and an instructor direct message facility. Udemy is well-known for offering highly-related data science training courses for learning data visualization and effective data management.Analyze Data with Python at Codeacademy
Analyze Data with Python at Codeacademy offers the fundamentals of data analysis while building Python skills efficiently and effectively in the data science training course. Aspiring data scientists can learn about Python, NumPy, SciPy, and many more to gain Python skills, data management, data visualization, etc. to earn a valuable certificate after completion. There are multiple practical projects to gain a strong understanding of data science such as FetchMaker, A/B Testing, and so on. There are eight courses for aspiring data scientists to get specialized skills and step-by-step guidance to gain sufficient knowledge in a few months.Data Science Specialization from Johns Hopkins University at Coursera
Data Science Specialization from Johns Hopkins University at Coursera offers a ten-course introduction to data science from eminent teachers. Aspiring data scientists can learn to apply data science methods and techniques by enrolling for free from today. They can also gain knowledge of using R for data management and data visualization, navigating the data science pipeline for data acquisition, and many more. This data science training course provides a flexible schedule for approximately 11 months with seven hours per week. It offers hands-on projects for aspiring data scientists to complete for earning a valuable certificate to add value to the CV.Programming for Data Science with Python at Udacity
Programming for Data Science with Python at Udacity is a well-known data science training course for beginners. It helps to prepare a data science career with programming tools such as Python, SQL, and git. The estimated time to complete this data science course is three months at ten hours per week. Aspiring data scientists should enroll by November 3, 2023, to solve problems with effective data management and data visualization. There are real-world projects from industry experts with technical mentor support and a flexible learning program.Data Science for Everyone at DataCamp
Data Science for Everyone at DataCamp is one of the top data science training courses for beginners. It provides an introduction to data science without any involvement in coding. It includes 48 exercises with 15 videos for aspiring data scientists. They can learn about different data scientist roles, foundational topics, and many more. The course curriculum includes the introduction to data science, data collection and storage, data visualization, data preparation, and finally the experimentation and prediction.
NumPy arrays and Images
NumPy arrays find wide use in storing and manipulating image data. But what is image data really?
Images are made up of pixels that are stored in the form of an array. Each pixel has a value ranging between 0 to 255 – 0 indicating a black pixel and 255 indicating a white pixel. A colored image consists of three 2-D arrays, one for each of the color channels: Red, Green, and Blue, placed back-to-back thus making a 3-D array. Each value in the array constitutes a pixel value. So, the size of the array depends on the number of pixels along each dimension.
Have a look at the image below:
Python can read the image as an array using the scipy.misc.imread() method in the SciPy library. And when we output it, it is simply a 3-D array containing the pixel values:import numpy as np import matplotlib.pyplot as plt from scipy import misc # read image im = misc.imread('./original.jpg') # image im array([[[115, 106, 67], [113, 104, 65], [112, 103, 64], ..., [160, 138, 37], [160, 138, 37], [160, 138, 37]], [[117, 108, 69], [115, 106, 67], [114, 105, 66], ..., [157, 135, 36], [157, 135, 34], [158, 136, 37]], [[120, 110, 74], [118, 108, 72], [117, 107, 71], ...,
We can check the shape and type of this NumPy array:print(im.shape) print(type(type)) (561, 997, 3) numpy.ndarray
Now, since an image is just an array, we can easily manipulate it using an array function that we have looked at in the article. Like, we could flip the image horizontally using the np.flip() method:# flip plt.imshow(np.flip(im, axis=1))
Or you could normalize or change the range of values of the pixels. This is sometimes useful for faster computations.im/255 array([[[0.45098039, 0.41568627, 0.2627451 ], [0.44313725, 0.40784314, 0.25490196], [0.43921569, 0.40392157, 0.25098039], ..., [0.62745098, 0.54117647, 0.14509804], [0.62745098, 0.54117647, 0.14509804], [0.62745098, 0.54117647, 0.14509804]], [[0.45882353, 0.42352941, 0.27058824], [0.45098039, 0.41568627, 0.2627451 ], [0.44705882, 0.41176471, 0.25882353], ..., [0.61568627, 0.52941176, 0.14117647], [0.61568627, 0.52941176, 0.13333333], [0.61960784, 0.53333333, 0.14509804]], [[0.47058824, 0.43137255, 0.29019608], [0.4627451 , 0.42352941, 0.28235294], [0.45882353, 0.41960784, 0.27843137], ..., [0.6 , 0.52156863, 0.14117647], [0.6 , 0.52156863, 0.13333333], [0.6 , 0.52156863, 0.14117647]], ...,
Remember this is using the same concept of ufuncs and broadcasting that we saw in the article!
There are a lot more things that you could do to manipulate your images that would be useful when you are classifying images using Neural Networks. If you are interested in building your own image classifier, you could head here for an amazing tutorial on the topic!End Notes
This article was published as a part of the Data Science Blogathon.Introduction
Dear Data Engineers, this article is a very interesting topic. Let me give some flashback; a few years ago, Mr.Someone in the discussion coined the new word how ACID and BASE properties of DATA. Suddenly drop silence in the room. Everyone started staring at each other faces, few of them started saying H2SO4, HCL, HNO3, and H2CO3 are ACID KOH, and NaOH is BASE.
The person who threw the word, he got, stood up and said, Guys! Kindly listen to me. I know you all are A+ in your Engineering Chemistry or Chemical Engineering or Whatever Chemistry you learned during your schools and college. But I am talking about Data Engineering. But the one that I mentioned is key properties of the transaction, specifically from an operational perspective, yes! This is essential for OLTP and OLAP for current digital transformation and applicable for all industries to implement the best Operational systems and build Morden Data Warehouses. He started articulating all the ingredients very detail as follows, Let’s focus on.What is a Morden Database (DB)?
We know that databases are well-structured and organized collections of data stored on DB servers. The main focus on this is to store, manage and handle that data and processes the same for analytics intention to derive the necessary insights from it and build the various business solutions and make use of it to enhance the business opportunities. These are so-called modern database systems that would be managed specifically on the cloud systems. Those systems have been designed to handle them precisely in multiple cloud environments like Azure, AWS, and GCP.Why are ACID and BASE Important in This Morden Database World?
No worries, where ACID and BASE play here in this context, both are guiding stars leading organizations to the successful database management approach.
All, good! What is the problem with the existing DB management approach, and where are all these coming into the stage now? There are several reasons, guy! In this current data world, one of the major challenges with data, though, is the generating massive amounts of data that is to be processed in seconds, minutes, hrs and daily basis, I hope you all agree with me. So, we started calling this DATA as BIG DATA. What is the scope of it? Certainly, I can’t say it in one word or one line because there are many more.What Are the Benefits of ACID and BASE?
To get the most benefits from this, first, we have to enhance the capabilities and standards of the data during each action on it, it would be while inserting, updating, selecting and analyzing, and implementing DATA products with GOLDEN datasets, So the best technique in this data or data warehouse domain to create steering the convolutions of data management is through the use of various database sources.
To achieve this, ACID and BASE are a set of guiding standards that are used to guarantee that database transactions are processed consistently.
My quick glimpse for standards is that whenever the changes are made within a database that needs to be performed by nursing them and ensuring the data within doesn’t become tainted. By the way, we are applying the ACID properties to each transaction -modification of a rows in table/database is the best way to maintain the truth and consistency of a database.
Simplified operational functions
Reliable and durable storage.What is ACID?
ACID refers to the four major properties that define Atomicity, Consistency, Isolation, and Durability.
ACID transaction If your database operations have all these ACID properties, we can call an ACID transaction, and the data storage that applies this set of operations is called an ACID transaction system.
This guaranteed data integrity regardless of System and Power failures and errors or other issues with respect to data and its transaction activities, such as creating a new record(s) or updating data the row(s).
In simple terms, ACID provides guideline principles that safeguard database transactions that are processed the same consistently.
Let’s focus on each property in detail now.
Atomicity: in just one word, I could say “Completed” or “Not at All” with respect to my transaction. Further simplified “DONE” or “Don’t Disturb.” Still confused, yes, I can understand. During the database transaction(s) we have to ensure that your commit statement make finishes the entire transaction operation successfully. If any other cases like DB connection issues, internet outages, power outages, data constraints missing, or quality of data interludes in the middle of an operation(s), the database should roll back into its prior state of safe position and hold the right data by commit statement being initiated.
By using atomicity, we can ensure that either the entire transaction was completed or that nothing of it was done.
Consistency: As we know, always the expectation from anything is consistency, regardless of the database as well; it means maintaining data integrity constraints across the journey to sustain the quality and performance perspective. This process stage will be abandoned, and changes will be rolled back to their previous state to retain consistency.
Each transaction is performed in serializable mode and in distinct order without impacting any other transactions happening in parallel. In focused ways, multiple transactions can occur parallelly, and each transaction has no possibility of impacting each other transactions occurring at the same time. We could accomplish between two ends, which would be optimistic and pessimistic transactions scope.
• An optimistic transaction will ensure no duplicate reading or writing in the same place twice or more. This approach transactions will be terminated in the case of duplicate transactions.
As we know that, durability ensures stability and sustainability; in the same fashion, even in any system failure, as we discussed earlier in the database(s) that the changes are successfully committed and will survive constantly and make sure that the data is NOT corrupted at any cost.How are ACID Transactions Implemented?
Identify the Location of the record that needs to be updated from the Table/DB server.
Get ready with buffer memory for transferring the block disk into the memory space.
Start your updates in that memory space.
Start pushing the modified block back out to disk.
Lock the transaction(s) until a transaction completes or fails.
Make sure the transactions are stored in the transaction logs table/files
Data is saved in a separate repository, then callout as ACID, implemented in the actual database.
If any case of system failure in the mid-transaction, the transaction should either roll back or continue from where the transaction log left off.
All done in the best ways! The ACID is in place.
He has highlighted how in chemistry, a BASE is the opposite of ACID; even in Database concepts as well we have a similar relationship again; the BASE concepts used to provide numerous benefits over ACID, and the prior one is more focused on Data Availability of database systems, and BASE relates to ACID indirectly.
We could derive the words behind BA S E.
• Basically Available – Availability is the key factor in the current digital world; in the BASE context, databases will guarantee the availability of required data by replicating it across the different geography and rather than enforcing immediate consistency on the database, in the cloud (Azure) technology this is mandatory action item while implementing any data components and this comes along with simple and powerful configuration process.
• Soft State – Do not check any business rules; stay written consistently.
• Eventually Consistent – In the BASE context, there won’t be a guarantee of enforcement and consistency, but this makes simplicity in the database make sure that it always gets the last refreshed data.
In this Morden database engineering culture, there many options to bring BASE implies Databases than ACID specific, here the few examples are NoSQL databases, these types are more be inclined toward BASE principles, my favorites are MongoDB, Cosmos DB, and Cassandra, but some NoSQL databases are also related and apply partially to ACID rules, which is required for functions facets. Which can be useful for the Data Warehouses and Data Lake in the staging layer.
Mr. Someone has completed his big round of journey on ACID and BASE. Finally, the folks in the meeting room asked whether we have Ph values in the Database and any specific factors to improve and neutralize those. He replied Yes! We will discuss this in the next meeting and close the meeting.Conclusion
Guys! I hope you understood, and I believe below are the takeaway from this article.
What are a Morden Database (DB) and its features?
What and ACID and BASE and why are both important in this Morden Database world to survey
Advantageous over the implementation of ACID in Database
A very detailed study about ACID and how to implement the same with simple steps
How BASE is more flexible than the ACID and available database in the market.
ACID transaction’s pitfalls
Since we’re using the locking mechanism, ACID transactions tend to be sluggish with the Read and Write operations. So high-volume applications might hit the performance.
So the choice is yours, based on strong consistency or availability, slower with ACID-compliant DBs or No ACID-compliant but faster.
Remember, Data consistency, Data Quality, and availability aspects are major interesting for decision-making and prediction.
Thanks a lot for your time, and I will get back with another interesting topic shortly! Till then, bye! – Shantha
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion
One of the most common questions we get on Analytics Vidhya is,
How much maths do I need to learn to be a data scientist?
Even though the question sounds simple, there is no simple answer to the the question. Usually, we say that you need to know basic descriptive and inferential statistics to start. That is good to start.
But, once you have covered the basic concepts in machine learning, you will need to learn some more math. You need it to understand how these algorithms work. What are their limitations and in case they make any underlying assumptions. Now, there could be a lot of areas to study including algebra, calculus, statistics, 3-D geometry etc.
If you get confused (like I did) and ask experts what should you learn at this stage, most of them would suggest / agree that you go ahead with Linear Algebra.
But, the problem does not stop there. The next challenge is to figure out how to learn Linear Algebra. You can get lost in the detailed mathematics and derivation and learning them would not help as much! I went through that journey myself and hence decided to write this comprehensive guide.
If you have faced this question about how to learn & what to learn in Linear Algebra – you are at the right place. Just follow this guide.
And if you’re looking to understand where linear algebra fits into the overall data science scheme, here’s the perfect article:Table of contents
Motivation – Why learn Linear Algebra?
3.3 Representing in Matrix form
4.2.3 Use of Inverse in Data Science
5.2 Use of Eigenvectors in Data Science: PCA algorithm
Singular Value Decomposition of a Matrix
End Notes1. Motivation – Why learn Linear Algebra?
I would like to present 4 scenarios to showcase why learning Linear Algebra is important, if you are learning Data Science and Machine Learning.Scenario 1:
What do you see when you look at the image above? You most likely said flower, leaves -not too difficult. But, if I ask you to write that logic so that a computer can do the same for you – it will be a very difficult task (to say the least).
You were able to identify the flower because the human brain has gone through million years of evolution. We do not understand what goes in the background to be able to tell whether the colour in the picture is red or black. We have somehow trained our brains to automatically perform this task.
But making a computer do the same task is not an easy task, and is an active area of research in Machine Learning and Computer Science in general. But before we work on identifying attributes in an image, let us ponder over a particular question- How does a machine stores this image?
You probably know that computers of today are designed to process only 0 and 1. So how can an image such as above with multiple attributes like colour be stored in a computer? This is achieved by storing the pixel intensities in a construct called Matrix. Then, this matrix can be processed to identify colours etc.
So any operation which you want to perform on this image would likely use Linear Algebra and matrices at the back end.Scenario 2:
If you are somewhat familiar with the Data Science domain, you might have heard about the world “XGBOOST” – an algorithm employed most frequently by winners of Data Science Competitions. It stores the numeric data in the form of Matrix to give predictions. It enables XGBOOST to process data faster and provide more accurate results. Moreover, not just XGBOOST but various other algorithms use Matrices to store and process data.Scenario 3:
Deep Learning- the new buzz word in town employs Matrices to store inputs such as image or speech or text to give a state-of-the-art solution to these problems. Weights learned by a Neural Network are also stored in Matrices. Below is a graphical representation of weights stored in a Matrix.Scenario 4:
Another active area of research in Machine Learning is dealing with text and the most common techniques employed are Bag of Words, Term Document Matrix etc. All these techniques in a very similar manner store counts(or something similar) of words in documents and store this frequency count in a Matrix form to perform tasks like Semantic analysis, Language translation, Language generation etc.
So, now you would understand the importance of Linear Algebra in machine learning. We have seen image, text or any data, in general, employing matrices to store and process data. This should be motivation enough to go through the material below to get you started on Linear Algebra. This is a relatively long guide, but it builds Linear Algebra from the ground up.2. Representation of problems in Linear Algebra
Let’s start with a simple problem. Suppose that price of 1 ball & 2 bat or 2 ball and 1 bat is 100 units. We need to find price of a ball and a bat.
Suppose the price of a bat is Rs ‘x’ and the price of a ball is Rs ‘y’. Values of ‘x’ and ‘y’ can be anything depending on the situation i.e. ‘x’ and ‘y’ are variables.
Let’s translate this in mathematical form –
2x + y = 100 ...........(1)
Similarly, for the second condition-
x + 2y = 100 ..............(2)
Now, to find the prices of bat and ball, we need the values of ‘x’ and ‘y’ such that it satisfies both the equations. The basic problem of linear algebra is to find these values of ‘x’ and ‘y’ i.e. the solution of a set of linear equations.
Broadly speaking, in linear algebra data is represented in the form of linear equations. These linear equations are in turn represented in the form of matrices and vectors.
The number of variables as well as the number of equations may vary depending upon the condition, but the representation is in form of matrices and vectors.2.1 Visualise the problem
It is usually helpful to visualize data problems. Let us see if that helps in this case.
Linear equations represent flat objects. We will start with the simplest one to understand i.e. line. A line corresponding to an equation is the set of all the points which satisfy the given equation. For example,
Points (50,0) , (0,100), (100/3,100/3) and (30,40) satisfy our equation (1) . So these points should lie on the line corresponding to our equation (1). Similarly, (0,50),(100,0),(100/3,100/3) are some of the points that satisfy equation (2).
Now in this situation, we want both of the conditions to be satisfied i.e. the point which lies on both the lines. Intuitively, we want to find the intersection point of both the lines as shown in the figure below.
Let’s solve the problem by elementary algebraic operations like addition, subtraction and substitution.
2x + y = 100 .............(1)
x + 2y = 100 ..........(2)
from equation (1)-
y = (100- x)/2
put value of y in equation (2)-
x + 2*(100-x)/2 = 100......(3)
Now, since the equation (3) is an equation in single variable x, it can be solved for x and subsequently y.
That looks simple – let’s go one step further and explore.2.2 Let’s complicate the problem
Now, suppose you are given a set of three conditions with three variables each as given below and asked to find the values of all the variables. Let’s solve the problem and see what happens.
From equation (4) we get,
Substituting value of z in equation (6), we get –
Now, we can solve equations (8) and (5) as a case of two variables to find the values of ‘x’ and ‘y’ in the problem of bat and ball above. Once we know‘x’ and ‘y’, we can use (7) to find the value of ‘z’.
As you might see, adding an extra variable has tremendously increased our efforts for finding the solution of the problem. Now imagine having 10 variables and 10 equations. Solving 10 equations simultaneously can prove to be tedious and time consuming. Now dive into data science. We have millions of data points. How do you solve those problems?
We have millions of data points in a real data set. It is going to be a nightmare to reach to solutions using the approach mentioned above. And imagine if we have to do it again and again and again. It’s going to take ages before we can solve this problem. And now if I tell you that it’s just one part of the battle, what would you think? So, what should we do? Should we quit and let it go? Definitely NO. Then?
Matrix is used to solve a large set of linear equations. But before we go further and take a look at matrices, let’s visualise the physical meaning of our problem. Give a little bit of thought to the next topic. It directly relates to the usage of Matrices.2.3 Planes
A linear equation in 3 variables represents the set of all points whose coordinates satisfy the equations. Can you figure out the physical object represented by such an equation? Try to think of 2 variables at a time in any equation and then add the third one. You should figure out that it represents a three-dimensional analogue of line.
Basically, a linear equation in three variables represents a plane. More technically, a plane is a flat geometric object which extends up to infinity.
As in the case of a line, finding solutions to 3 variables linear equation means we want to find the intersection of those planes. Now can you imagine, in how many ways a set of three planes can intersect? Let me help you out. There are 4 possible cases –
No intersection at all.
Planes intersect in a line.
They can intersect in a plane.
All the three planes intersect at a point.
Can you imagine the number of solutions in each case? Try doing this. Here is an aid picked from Wikipedia to help you visualise.
So, what was the point of having you to visualise all graphs above?
Normal humans like us and most of the super mathematicians can only visualise things in 3-Dimensions, and having to visualise things in 4 (or 10000) dimensions is difficult impossible for mortals. So, how do mathematicians deal with higher dimensional data so efficiently? They have tricks up their sleeves and Matrices is one such trick employed by mathematicians to deal with higher dimensional data.
Now let’s proceed with our main focus i.e. Matrix.3. Matrix
Matrix is a way of writing similar things together to handle and manipulate them as per our requirements easily. In Data Science, it is generally used to store information like weights in an Artificial Neural Network while training various algorithms. You will be able to understand my point by the end of this article.
Technically, a matrix is a 2-D array of numbers (as far as Data Science is concerned). For example look at the matrix A below.
1 2 3
4 5 6
7 8 9
Generally, rows are denoted by ‘i’ and column are denoted by ‘j’. The elements are indexed by ‘i’th row and ‘j’th chúng tôi denote the matrix by some alphabet e.g. A and its elements by A(ij).
In above matrix
A12 = 2
To reach to the result, go along first row and reach to second column.3.1 Terms related to Matrix
Order of matrix – If a matrix has 3 rows and 4 columns, order of the matrix is 3*4 i.e. row*column.
Square matrix – The matrix in which the number of rows is equal to the number of columns.
Diagonal matrix – A matrix with all the non-diagonal elements equal to 0 is called a diagonal matrix.
Upper triangular matrix – Square matrix with all the elements below diagonal equal to 0.
Lower triangular matrix – Square matrix with all the elements above the diagonal equal to 0.
Scalar matrix – Square matrix with all the diagonal elements equal to some constant k.
Identity matrix – Square matrix with all the diagonal elements equal to 1 and all the non-diagonal elements equal to 0.
Column matrix – The matrix which consists of only 1 column. Sometimes, it is used to represent a vector.
Row matrix – A matrix consisting only of row.
Trace – It is the sum of all the diagonal elements of a square matrix.3.2 Basic operations on matrix
Let’s play with matrices and realise the capabilities of matrix operations.
Addition – Addition of matrices is almost similar to basic arithmetic addition. All you need is the order of all the matrices being added should be same. This point will become obvious once you will do matrix addition by yourself.
Suppose we have 2 matrices ‘A’ and ‘B’ and the resultant matrix after the addition is ‘C’. Then
Cij = Aij + Bij
For example, let’s take two matrices and solve them.
Observe that to get the elements of C matrix, I have added A and B element-wise i.e. 1 to 4, 3 to 5 and so on.
Scalar Multiplication – Multiplication of a matrix with a scalar constant is called scalar multiplication. All we have to do in a scalar multiplication is to multiply each element of the matrix with the given constant. Suppose we have a constant scalar ‘c’ and a matrix ‘A’. Then multiplying ‘c’ with ‘A’ gives-
c[Aij] = [c*Aij]
Transposition – Transposition simply means interchanging the row and column index. For example-
Transpose is used in vectorized implementation of linear and logistic regression.
Code in python
Code in R
View the code on Gist.
Output[,1] [,2] [,3] [1,] 11 12 13 [2,] 14 15 16 [3,] 17 18 19
View the code on Gist.t(A) [,1] [,2] [,3] [1,] 11 14 17 [2,] 12 15 18 [3,] 13 16 19
Matrix multiplication is one of the most frequently used operations in linear algebra. We will learn to multiply two matrices as well as go through its important properties.
Before landing to algorithms, there are a few points to be kept in mind.
The multiplication of two matrices of orders i*j and j*k results into a matrix of order i*k. Just keep the outer indices in order to get the indices of the final matrix.
Two matrices will be compatible for multiplication only if the number of columns of the first matrix and the number of rows of the second one are same.
The third point is that order of multiplication matters.
Don’t worry if you can’t get these points. You will be able to understand by the end of this section.
Suppose, we are given two matrices A and B to multiply. I will write the final expression first and then will explain the steps.
I have picked this image from Wikipedia for your better understanding.
In the first illustration, we know that the order of the resulting matrix should be 3*3. So first of all, create a matrix of order 3*3. To determine (AB)ij , multiply each element of ‘i’th row of A with ‘j’th column of B one at a time and add all the terms. To help you understand element-wise multiplication, take a look at the code below.
import numpy as np
A.dot(B)AB= array([[2250, 2316, 2382], [2556, 2631, 2706], [2862, 2946, 3030]]) B.dot(A) BA= array([[2310, 2406, 2502], [2526, 2631, 2736], [2742, 2856, 2970]])
So, how did we get 2250 as first element of AB matrix? 2250=21*31+22*34+23*37. Similarly, for other elements.
Code in R
View the code on Gist.A*B [,1] [,2] [,3] [1,] 220 252 286 [2,] 322 360 400 [3,] 442 486 532
Notice the difference between AB and BA.
Properties of matrix multiplication
Matrix multiplication is associative provided the given matrices are compatible for multiplication i.e.
ABC = (AB)C = A(BC)
temp1=(A.dot(B)).dot(C)array([[306108, 313056, 320004], [347742, 355635, 363528], [389376, 398214, 407052]])
temp2=A.dot((B.dot(C)))array([[306108, 313056, 320004], [347742, 355635, 363528], [389376, 398214, 407052]])
2. Matrix multiplication is not commutative i.e. AB and BA are not equal. We have verified this result above.
Matrix multiplication is used in linear and logistic regression when we calculate the value of output variable by parameterized vector method. As we have learned the basics of matrices, it’s time to apply them.3.3 Representing equations in matrix form
Let me do something exciting for you. Take help of pen and paper and try to find the value of the matrix multiplication shown below
It can be verified very easily that the expression contains our three equations. We will name our matrices as ‘A’, ‘X’ and ‘Z’.
It explicitly verifies that we can write our equations together in one place as
AX = Z
Next step has to be solution chúng tôi will go through two methods to find the solution.4. Solving the Problem
Now, we will look in detail the two methods to solve matrix equations.
Row Echelon Form
Inverse of a Matrix4.1 Row Echelon form
Now you have visualised what an equation in 3 variables represents and had a warm up on matrix operations. Let’s find the solution of the set of equations given to us to understand our first method of interest and explore it later in detail.
I have already illustrated that solving the equations by substitution method can prove to be tedious and time taking. Our first method introduces you with a neater and more systematic method to accomplish the job in which, we manipulate our original equations systematically to find the solution. But what are those valid manipulations? Are there any qualifying criteria they have to fulfil? Well, yes. There are two conditions which have to be fulfilled by any manipulation to be valid.
Manipulation should preserve the solution i.e. solution should not be altered on imposing the manipulation.
Manipulation should be reversible.
So, what are those manipulations?
We can swap the order of equations.
We can multiply both sides of equations by any non-zero constant ‘c’.
We can multiply an equation by any non-zero constant and then add to other equation.
These points will become more clear once you go through the algorithm and practice it. The basic idea is to clear variables in successive equations and form an upper triangular matrix. Equipped with prerequisites, let’s get started. But before that, it is strongly recommended to go through this link for better understanding.
I will solve our original problem as an illustration. Let’s do it in steps.
Make an augmented matrix from the matrix ‘A’ and ‘Z’.
What I have done is I have just concatenated the two matrices. The augmented matrix simply tells that the elements in a row are coefficients of ‘x’, ‘y’ and ‘z’ and last element in the row is right-hand side of the equation.
Multiply row (1) with 2 and subtract from row (2). Similarly, multiply equation 1 with 5 and subtract from row (3).
In order to make an upper triangular matrix, multiply row (2) by 2 and then subtract from row (3).
Now we have simplified our job, let’s retrieve the modified equations. We will start from the simplest i.e. the one with the minimum number of remaining variables. If you follow the illustrated procedure, you will find that last equation comes to be the simplest one.
Now retrieve equation (2) and put the value of ‘z’ in it to find ‘y’. Do the same for equation (1).
Isn’t it pretty simple and clean?
Let’s ponder over another point. Will we always be able to make an upper triangular matrix which gives a unique solution? Are there different cases possible? Recall that planes can intersect in multiple ways. Take your time to figure it out and then proceed further.
Different possible cases-
It’s possible that we get a unique solution as illustrated in above example. It indicates that all the three planes intersect in a point.
We can get a case like shown below
Note that in last equation, 0=0 which is always true but it seems like we have got only 2 equations. One of the equations is redundant. In many cases, it’s also possible that the number of redundant equations is more than one. In this case, the number of solutions is infinite.
There is another case where Echelon matrix looks as shown below
Let’s retrieve the last equation.
Is it possible? Very clear cut intuition is NO. But, does this signify something? It’s analogous to saying that it is impossible to find a solution and indeed, it is true. We can’t find a solution for such a set of equations. Can you think what is happening actually in terms of planes? Go back to the section where we saw planes intersecting and find it out.
Note that this method is efficient for a set of 5-6 equations. Although the method is quite simple, if equation set gets larger, the number of times you have to manipulate the equations becomes enormously high and the method becomes inefficient.
Rank of a matrix – Rank of a matrix is equal to the maximum number of linearly independent row vectors in a matrix.
A set of vectors is linearly dependent if we can express at least one of the vectors as a linear combination of remaining vectors in the set.4.2 Inverse of a Matrix
For solving a large number of equations in one go, the inverse is used. Don’t panic if you are not familiar with the inverse. We will do a good amount of work on all the required concepts. Let’s start with a few terms and operations.
Determinant of a Matrix – The concept of determinant is applicable to square matrices only. I will lead you to the generalised expression of determinant in steps. To start with, let’s take a 2*2 matrix A.
For now, just focus on 2*2 matrix. The expression of determinant of the matrix A will be:
Note that det(A) is a standard notation for determinant. Notice that all you have to do to find determinant in this case is to multiply diagonal elements together and put a positive or negative sign before them. For determining the sign, sum the indices of a particular element. If the sum is an even number, put a positive sign before the multiplication and if the sum is odd, put a negative sign. For example, the sum of indices of element ‘a11’ is 2. Similarly the sum of indices of element ‘d’ is 4. So we put a positive sign before the first term in the expression. Do the same thing for the second term yourself.
Now take a 3*3 matrix ‘B’ and find its determinant.
I am writing the expression first and then will explain the procedure step by step.
Each term consists of two parts basically i.e. a submatrix and a coefficient. First of all, pick a constant. Observe that coefficients are picked from the first row only. To start with, I have picked the first element of the first row. You can start wherever you want. Once you have picked the coefficient, just delete all the elements in the row and column corresponding to the chosen coefficient. Next, make a matrix of the remaining elements; each one in its original position after deleting the row and column and find the determinant of this submatrix . Repeat the same procedure for each element in the first row. Now, for determining the sign of the terms, just add the indices of the coefficient element. If it is even, put a positive sign and if odd, put a negative sign. Finally, add all the terms to find the determinant. Now, let’s take a higher order matrix ‘C’ and generalise the concept.
Try to relate the expression to what we have done already and figure out the final expression.
Code in python
arr = np.arange(100,116).reshape(4,4)array([[100, 101, 102, 103], [104, 105, 106, 107], [108, 109, 110, 111], [112, 113, 114, 115]])
Code in R
View the code on Gist.[,1] [,2] [,3] [1,] -0.16208333 -0.1125 0.17458333 [2,] -0.07916667 0.1250 -0.04583333 [3,] 0.20791667 -0.0125 -0.09541667 #Determinant -0.0004166667
Minor of a matrix
Let’s take a square matrix A. then minor corresponding to an element A(ij) is the determinant of the submatrix formed by deleting the ‘i’th row and ‘j’th column of the matrix. Hope you can relate with what I have explained already in the determinant section. Let’s take an example.
To find the minor corresponding to element A11, delete first row and first column to find the submatrix.
Now find the determinant of this matrix as explained already. If you calculate the determinant of this matrix, you should get 4. If we denote minor by M11, then
M11 = 4
Similarly, you can do for other elements.
Cofactor of a matrix
In the above discussion of minors, if we consider signs of minor terms, the resultant we get is called cofactor of a matrix. To assign the sign, just sum the indices of the corresponding element. If it turns out to be even, assign positive sign. Else assign negative. Let’s take above illustration as an example. If we add the indices i.e. 1+1=2, so we should put a positive sign. Let’s say it C11. Then
C11 = 4
You should find cofactors corresponding to other elements by yourself for a good amount of practice.
Find the cofactor corresponding to each element. Now in the original matrix, replace the original element by the corresponding cofactor. The matrix thus found is called the cofactor matrix corresponding to the original matrix.
For example, let’s take our matrix A. if you have found out the cofactors corresponding to each element, just put them in a matrix according to rule stated above. If you have done it right, you should get cofactor matrix
Adjoint of a matrix – In our journey to find inverse, we are almost at the end. Just keep hold of the article for a couple of minutes and we will be there. So, next we will find the adjoint of a matrix.
Suppose we have to find the adjoint of a matrix A. we will do it in two steps.
In step 1, find the cofactor matrix of A.
In step 2, just transpose the cofactor matrix.
The resulting matrix is the adjoint of the original matrix. For illustration, lets find the adjoint of our matrix A. we already have cofactor matrix C. Transpose of cofactor matrix should be
Finally, in the next section, we will find the inverse.4.2.1 Finding Inverse of a matrix
Do you remember the concept of the inverse of a number in elementary algebra? Well, if there exist two numbers such that upon their multiplication gives 1 then those two numbers are called inverse of each other. Similarly in linear algebra, if there exist two matrices such that their multiplication yields an identity matrix then the matrices are called inverse of each other. If you can not get what I explained, just go with the article. It will come intuitively to you. The best way to learning is learning by doing. So, let’s jump straight to the algorithm for finding the inverse of a matrix A. Again, we will do it in two steps.
Step 1: Find out the adjoint of the matrix A by the procedure explained in previous sections.
Step2: Multiply the adjoint matrix by the inverse of determinant of the matrix A. The resulting matrix is the inverse of A.
For example, let’s take our matrix A and find it’s inverse. We already have the adjoint matrix. Determinant of matrix A comes to be -2. So, its inverse will be
Now suppose that the determinant comes out to be 0. What happens when we invert the determinant i.e. 0? Does it make any sense? It indicates clearly that we can’t find the inverse of such a matrix. Hence, this matrix is non-invertible. More technically, this type of matrix is called a singular matrix.
Keep in mind that the resultant of multiplication of a matrix and its inverse is an identity matrix. This property is going to be used extensively in equation solving.
Inverse is used in finding parameter vector corresponding to minimum cost function in linear regression.4.2.2 Power of matrices
What happens when we multiply a number by 1? Obviously it remains the same. The same is applicable for an identity matrix i.e. if we multiply a matrix with an identity matrix of the same order, it remains same.
Lets solve our original problem with the help of matrices. Our original problem represented in matrix was as shown below
AX = Z i.e.
What happens when we pre multiply both the sides with inverse of coefficient matrix i.e. A. Lets find out by doing.
A-1 A X =A-1 Z
We can manipulate it as,
(A-1 A) X = A -1Z
But we know multiply a matrix with its inverse gives an Identity Matrix. So,
IX = A -1Z
Where I is the identity matrix of the corresponding order.
If you observe keenly, we have already reached to the solution. Multiplying identity matrix to X does not change it. So the equation becomes
X = A -1Z
For solving the equation, we have to just find the inverse. It can be very easily done by executing a few lines of codes. Isn’t it a really powerful method?
Code for inverse in python
arr1 = np.arange(5,21).reshape(4,4)
np.linalg.inv(arr1)4.2.3 Application of inverse in Data Science
Inverse is used to calculate parameter vector by normal equation in linear equation. Here is an illustration. Suppose we are given a data set as shown below-
Team League Year RS RA W OBP SLG BA G OOBP OSLG
ARI NL 2012 734 688 81 0.328 0.418 0.259 162 0.317 0.415
ATL NL 2012 700 600 94 0.32 0.389 0.247 162 0.306 0.378
BAL AL 2012 712 705 93 0.311 0.417 0.247 162 0.315 0.403
BOS AL 2012 734 806 69 0.315 0.415 0.26 162 0.331 0.428
CHC NL 2012 613 759 61 0.302 0.378 0.24 162 0.335 0.424
CHW AL 2012 748 676 85 0.318 0.422 0.255 162 0.319 0.405
CIN NL 2012 669 588 97 0.315 0.411 0.251 162 0.305 0.39
CLE AL 2012 667 845 68 0.324 0.381 0.251 162 0.336 0.43
COL NL 2012 758 890 64 0.33 0.436 0.274 162 0.357 0.47
DET AL 2012 726 670 88 0.335 0.422 0.268 162 0.314 0.402
HOU NL 2012 583 794 55 0.302 0.371 0.236 162 0.337 0.427
KCR AL 2012 676 746 72 0.317 0.4 0.265 162 0.339 0.423
LAA AL 2012 767 699 89 0.332 0.433 0.274 162 0.31 0.403
LAD NL 2012 637 597 86 0.317 0.374 0.252 162 0.31 0.364
It describes the different variables of different baseball teams to predict whether it makes to playoffs or not. But for right now to make it a regression problem, suppose we are interested in predicting OOBP from the rest of the variables. So, ‘OOBP’ is our target variable. To solve this problem using linear regression, we have to find parameter vector. If you are familiar with Normal equation method, you should have the idea that to do it, we need to make use of Matrices. Lets proceed further and denote our Independent variables below as matrix ‘X’.This data is a part of a data set taken from analytics edge. Here is the link for the data set.
734 688 81 0.328 0.418 0.259
700 600 94 0.32 0.389 0.247
712 705 93 0.311 0.417 0.247
734 806 69 0.315 0.415 0.26
613 759 61 0.302 0.378 0.24
748 676 85 0.318 0.422 0.255
669 588 97 0.315 0.411 0.251
667 845 68 0.324 0.381 0.251
758 890 64 0.33 0.436 0.274
726 670 88 0.335 0.422 0.268
583 794 55 0.302 0.371 0.236
676 746 72 0.317 0.4 0.265
767 699 89 0.332 0.433 0.274
637 597 86 0.317 0.374 0.252
To find the final parameter vector(θ) assuming our initial function is parameterised by θ and X , all you have to do is to find the inverse of (XT X) which can be accomplished very easily by using code as shown below.
First of all, let me make the Linear Regression formulation easier for you to comprehend.
f θ (X)= θT X, where θ is the parameter we wish to calculate and X is the column vector of features or independent variables.
import numpy as np
#you don’t need to bother about the following. It just #transforms the data from original source into matrix
Df1 = df.head(14)
X = np.asmatrix(X)
Imagine if you had to solve this set of equations without using linear algebra. Let me remind you that this data set is less than even 1% of original date set. Now imagine if you had to find parameter vector without using linear algebra. It would have taken a lots of time and effort and could be even impossible to solve sometimes.
One major drawback of normal equation method when the number of features is large is that it is computationally very costly. The reason is that if there are ‘n’ features, the matrix (XT X) comes to be the order n*n and its solution costs time of order O( n*n*n). Generally, normal equation method is applied when a number of features is of the order of 1000 or 10,000. Data sets with a larger number of features are handled with the help another method called Gradient Descent.5. Eigenvalues and Eigenvectors
Eigenvectors find a lot of applications in different domains like computer vision, physics and machine learning. If you have studied machine learning and are familiar with Principal component analysis algorithm, you must know how important the algorithm is when handling a large data set. Have you ever wondered what is going on behind that algorithm? Actually, the concept of Eigenvectors is the backbone of this algorithm. Let us explore Eigen vectors and Eigen values for a better understanding of it.
Let’s multiply a 2-dimensional vector with a 2*2 matrix and see what happens.
This operation on a vector is called linear transformation. Notice that the directions of input and output vectors are different. Note that the column matrix denotes a vector here.
I will illustrate my point with the help of a picture as shown below.
In the above picture, there are two types of vectors coloured in red and yellow and the picture is showing the change in vectors after a linear transformation. Note that on applying a linear transformation to yellow coloured vector, its direction changes but the direction of the red coloured vector doesn’t change even after applying the linear transformation. The vector coloured in red is an example of Eigenvector.
Precisely, for a particular matrix; vectors whose direction remains unchanged even after applying linear transformation with the matrix are called Eigenvectors for that particular matrix. Remember that the concept of Eigen values and vectors is applicable to square matrices only. Another thing to know is that I have taken a case of two-dimensional vectors but the concept of Eigenvectors is applicable to a space of any number of dimensions.5.1 How to find Eigenvectors of a matrix?
Suppose we have a matrix A and an Eigenvector ‘x’ corresponding to the matrix. As explained already, after multiplication with matrix the direction of ‘x’ doesn’t change. Only change in magnitude is permitted. Let us write it as an equation-
Ax = cx
(A-c)x = 0 …….(1)
Please note that in the term (A-c), ‘c’ denotes an identity matrix of the order equal to ‘A’ multiplied by a scalar ‘c’
We have two unknowns ‘c’ and ‘x’ and only one equation. Can you think of a trick to solve this equation?
In equation (1), if we put the vector ‘x’ as zero vector, it makes no sense. Hence, the only choice is that (A-c) is a singular matrix. And singular matrix has a property that its determinant equals to 0. We will use this property to find the value of ‘c’.
Det(A-c) = 0
Once you find the determinant of the matrix (A-c) and equate to 0, you will get an equation in ‘c’ of the order depending upon the given matrix A. all you have to do is to find the solution of the equation. Suppose that we find solutions as ‘c1’ , ‘c2’ and so on. Put ‘c1’ in equation (1) and find the vector ‘x1’ corresponding to ‘c1’. The vector ‘x1’ that you just found is an Eigenvector of A. Now, repeat the same procedure with ‘c2’, ‘c3’ and so on.
Code for finding EigenVectors in python
import numpy as np
arr = np.arange(1,10).reshape(3,3)
Code in R for finding Eigenvalues and Eigenvectors:
View the code on Gist.
Output147.737576 5.317459 -3.055035 [,1] [,2] [,3] [1,] -0.3948374 0.4437557 -0.74478185 [2,] -0.5497457 -0.8199420 -0.06303763 [3,] -0.7361271 0.3616296 0.66432391 5.2 Use of Eigenvectors in Data Science
The concept of Eigenvectors is applied in a machine learning algorithm Principal Component Analysis. Suppose you have a data with a large number of features i.e. it has a very high dimensionality. It is possible that there are redundant features in that data. Apart from this, a large number of features will cause reduced efficiency and more disk space. What PCA does is that it craps some of lesser important features. But how to determine those features? Here, Eigenvectors come to our rescue.Let’s go through the algorithm of PCA. Suppose we have an ‘n’ dimensional data and we want to reduce it to ‘k’ dimensions. We will do it in steps.
Step 1: Data is mean normalised and feature scaled.
Step 2: We find out the covariance matrix of our data set.
Now we want to reduce the number of features i.e. dimensions. But cutting off features means loss of information. We want to minimise the loss of information i.e. we want to keep the maximum variance. So, we want to find out the directions in which variance is maximum. We will find these directions in the next step.
Step 4: We will select ‘k’ Eigenvectors corresponding to the ‘k’ largest Eigenvalues and will form a matrix in which each Eigenvector will constitute a column. We will call this matrix as U.
Now it’s the time to find the reduced data points. Suppose you want to reduce a data point ‘a’ in the data set to ‘k’ dimensions. To do so, you have to just transpose the matrix U and multiply it with the vector ‘a’. You will get the required vector in ‘k’ dimensions.6. Singular Value Decomposition
Suppose you are given a feature matrix A. As suggested by name, what we do is we decompose our matrix A in three constituent matrices for a special purpose. Sometimes, it is also said that svd is some sort of generalisation of Eigen value decomposition. I will not go into its mathematics for the reason already explained and will stick to our plan i.e. use of svd in data science.
Svd is used to remove the redundant features in a data set. Suppose you have a data set which comprises of 1000 features. Definitely, any real data set with such a large number of features is bound to contain redundant features. if you have run ML, you should be familiar with the fact that Redundant features cause a lots of problems in running machine learning algorithms. Also, running an algorithm on the original data set will be time inefficient and will require a lot of memory. So, what should you to do handle such a problem? Do we have a choice? Can we omit some features? Will it lead to significant amount of information loss? Will we be able to get an efficient enough algorithm even after omitting the rows? I will answer these questions with the help of an illustration.
Look at the pictures shown below taken from this link
We can convert this tiger into black and white and can think of it as a matrix whose elements represent the pixel intensity as relevant location. In simpler words, the matrix contains information about the intensity of pixels of the image in the form of rows and columns. But, is it necessary to have all the columns in the intensity matrix? Will we be able to represent the tiger with a lesser amount of information? The next picture will clarify my point. In this picture, different images are shown corresponding to different ranks with different resolution. For now, just assume that higher rank implies the larger amount of information about pixel intensity. The image is taken from this link
It is clear that we can reach to a pretty well image with 20 or 30 ranks instead of 100 or 200 ranks and that’s what we want to do in a case of highly redundant data. What I want to convey is that to get a reasonable hypothesis, we don’t have to retain all the information present in the original dataset. Even, some of the features cause a problem in reaching a solution to the best algorithm. For the example, presence of redundant features causes multi co-linearity in linear regression. Also, some features are not significant for our model. Omitting these features helps to find a better fit of algorithm along with time efficiency and lesser disk space. Singular value decomposition is used to get rid of the redundant features present in our data.7. End notes
We had an amazing opportunity to learn from Mr. Pavan. He is an experienced data engineer with a passion for problem-solving and a drive for continuous growth. Throughout the conversation, Mr. Pavan shares his journey, inspirations, challenges, and accomplishments. Thus, providing valuable insights into the field of data engineering.
As we explore Mr. Pavan’s achievements, we discover his pride in developing reusable components, creating streamlined data pipelines, and winning a global hackathon. His passion for helping clients grow their businesses through data engineering shines through as he shares the impact of his work on their success. So, let’s delve into the world of data engineering and learn from the experiences and wisdom of Mr. Pavan.Let’s Get Started with the Interview! AV: Please introduce yourself and shed some light on your background.
Mr. Pavan: I started my academic journey as an Information Technology student at graduation. The promising job opportunities in the field primarily drive me. However, my entire perspective on programming shifted while participating in an MS hackathon called Yappon! I discovered a profound passion for it. This experience became a turning point in my life, igniting a spark to explore the programming world further.
Since then, I have actively participated in four hackathons, with the exhilarating result of winning three. These experiences have sharpened my technical skills and instilled a relentless desire to automate tasks and find efficient solutions. I thrive on the challenge of streamlining processes and eliminating repetitive tasks through automation.
On a personal level, I consider myself an ambivert, finding a balance between introversion and extroversion. However, I am constantly pushing myself to step out of my comfort zone and embrace new opportunities for growth and development. One of my passions outside of programming is trekking. There is something incredibly captivating about exploring the great outdoors and immersing myself in the beauty of nature.
My journey as a computer science enthusiast began with a pragmatic outlook on job prospects. Still, it transformed into an unwavering passion for programming through my participation in hackathons. With a track record of successful projects and a knack for automation, I am eager to continue expanding my skills and making a positive impact in the field of computer science.AV: Can you name a few people who have influenced your career, and how have they inspired you?
Mr. Pavan: First, I am grateful to my mother and grandmother. They instilled in me the values encapsulated in the Sanskrit quote, ‘Shatkarma Manushya yatnanam, saptakam daiva chintanam.’ Their belief in the importance of human effort and divine contemplation deeply resonated with me. This philosophy emphasizes the balance between personal endeavor and spiritual reflection and has been a guiding principle throughout my career. Their unwavering support and belief in me have been a constant source of inspiration.
In addition, I am fortunate to have a supportive network of friends. They have played an integral role in my career journey. These friends have helped me understand complex programming concepts and motivated me to participate in hackathons and hone my skills. Their guidance and encouragement have been instrumental in pushing me beyond my limits and extracting the best out of me. I am immensely grateful for their presence in my life and for being an integral part of my progress thus far.AV: What drew you to work with data? What do you find most exciting about your role as a data engineer?
Mr. Pavan: What drew me to work with data was realizing that data drive everything in today’s world. Data is the foundation upon which decisions are made, strategies are formulated, and innovations are born. I was captivated by the immense power that data holds in shaping the success of any industry or organization. The ability to transform raw data into meaningful insights and leverage those insights to drive positive outcomes for customers and businesses became a driving force behind my passion for working with data.
As a data engineer, what excites me the most is the opportunity to be at the forefront of the data revolution. I am fascinated by the intricate process of designing and implementing data systems that efficiently capture, process, and analyze massive volumes of information. Data’s sheer magnitude and complexity present exhilarating challenges that require creative problem-solving and continuous learning.Must Have Skills for Data Engineers AV: What are some of the most important technical skills a data engineer should possess? How have you developed these skills over time?
Mr. Pavan: Regarding technical skills, several key proficiencies are essential for a data engineer. Firstly, a strong foundation in SQL is vital, as it is the backbone of data manipulation and querying. Writing efficient and optimized SQL queries is crucial in extracting, transforming, and loading data from various sources.
Proficiency in at least one object-oriented programming language, such as Python, Scala, or Java, is also highly valuable for a data engineer. These languages enable the development of data pipelines, data integration workflows, and the implementation of data processing algorithms. Being adept in programming allows for more flexibility and control in working with large datasets and performing complex transformations.
A solid understanding of data warehousing concepts is important as well. This includes knowledge of data modeling techniques, dimensional modeling, and familiarity with different data warehousing architectures. Data engineering involves designing and building data structures that enable efficient data retrieval and analysis, and a strong grasp of these concepts is essential for success in this field.
Additionally, having a working knowledge of data lake concepts and distributed computing is becoming increasingly important in modern data engineering. Understanding how to store, manage, and process data in a distributed and scalable manner using technologies like Apache Hadoop and Apache Spark is highly beneficial. Distributed computing frameworks like Apache Spark allow for parallel processing of large-scale datasets and enable high-performance data processing and analytics.
In my journey as a data engineer, I have developed these technical skills over time through a combination of academic learning, practical experience, and a continuous drive for improvement. SQL and object-oriented programming languages were integral parts of my academic curriculum.Problem Solving at its Core! AV: How do you approach problem-solving as a data engineer? What methods have you found most effective?
Mr. Pavan: As a data engineer, problem-solving is at the core of my role. When approaching a problem, I believe that identifying the right problem to solve is crucial. Taking the time to clearly understand the problem statement, its context, and its underlying goals allows me to define the problem accurately and set a clear direction for finding a solution.
I often start by gathering information and conducting research to begin the problem-solving process. I explore relevant documentation, online resources, and community forums to gain insights into existing solutions, best practices, and potential approaches. Learning from the experiences and expertise of others in the field helps me broaden my understanding and consider various perspectives.
Once I have a good grasp of the problem and the available resources, I devise a solution approach. I break down the problem into smaller, manageable tasks or components, which enables me to tackle them more effectively. I prioritize tasks based on their importance, dependencies, and potential impact on the solution.
I maintain a mindset of continuous learning and improvement throughout the problem-solving process. I am open to exploring new technologies, techniques, and methodologies that can enhance my problem-solving capabilities.Don’t Get Bogged Down by the Challenges AV: What are some of the biggest challenges you face as a data engineer, and how do you overcome them?
Mr. Pavan: As a data engineer, there are several challenges that I have encountered in my role. Here are a few of the biggest challenges and how I have learned to overcome them:Data Quality and Integrity
Ensuring the quality and integrity of data is crucial for accurate analysis and decision-making. However, working with diverse data sources and integrating data from various systems can lead to inconsistencies, missing values, and other data quality issues. To address this challenge, I employ robust data validation and cleansing techniques. I implement data validation checks, perform data profiling, and leverage data quality tools to identify and resolve anomalies. I also collaborate closely with data stakeholders and domain experts to understand the data and address quality concerns.Scalability and Performance
Evolving Technology Landscape
Collaboration and Communication
Data engineering often involves collaborating with cross-functional teams, including data scientists, analysts, and stakeholders. Effective communication and collaboration can be challenging, particularly when dealing with complex technical concepts. To address this challenge, I focus on building strong relationships with team members, actively listening to their requirements, and effectively conveying technical information clearly and concisely. Regular meetings and documentation can also facilitate collaboration and ensure everyone is aligned.AV: Having worked as a Data Engineer for approximately 4 years. What accomplishments are you most proud of, and why?
Mr. Pavan: One of my significant achievements is developing reusable components that can be easily plugged and played using configuration files. This initiative has saved a significant amount of work hours for my team and the organization as a whole. By creating these reusable components, we can now quickly and efficiently implement common data engineering tasks, reducing repetitive work and increasing productivity.
I take pride in developing a data pipeline/framework that has streamlined the process of onboarding new data sources. This framework allows us to integrate new data sources into our existing data infrastructure seamlessly. It has reduced the time required for data source onboarding and ensured data accuracy and consistency throughout the pipeline. The ability to deploy this framework rapidly has been instrumental in accelerating data-driven insights and decision-making within the organization.
Participating in and winning a global hackathon has been a significant achievement in my career. It demonstrated my ability to work under pressure, think creatively, and collaborate effectively with team members. Winning the hackathon showcased my problem-solving skills, technical expertise, and ability to deliver innovative solutions within a constrained timeframe. It validated my capabilities and recognized my hard work and dedication to the project.
I am proud of the contributions I have made to help customers grow their businesses. In additional, helping clients harness the power of data to drive their decision-making processes by focusing on delivering scalable, reliable, reusable, and performance/cost-optimized solutions is also something that I am proud of. By designing and implementing robust data engineering solutions, I have enabled businesses to leverage data effectively, derive actionable insights, and make informed strategic decisions. Witnessing my work’s positive impact on our customers’ success is incredibly rewarding and fuels my passion for data engineering.Industry Trends
I seek online courses and training programs from reputable platforms like Coursera, edX, and Udacity. These courses cover many topics, including data engineering, cloud computing, distributed systems, and machine learning. By enrolling in these courses, I can learn from experienced instructors, gain hands-on experience with new tools and frameworks, and stay updated on the latest industry practices.
I actively engage in helping aspiring data engineers through an online learning platform. This involvement allows me to interact with individuals seeking to enter the data engineering field. By answering their questions, providing guidance, and sharing my knowledge, I contribute to their learning journey and gain insights into their challenges and concerns. This experience enables me to understand different perspectives, learn about new technologies or approaches they are exploring, and continuously expand my knowledge base.
I actively sought out learning opportunities both within and outside my workplace. This involved attending workshops, webinars, and conferences to stay updated on industry trends and technologies. I also enrolled in online courses to enhance my knowledge and skills in specific areas of interest.
I actively sought projects that stretched my abilities and allowed me to gain new experiences. I expanded my skill set
y volunteering for challenging assignments. Additionally, I also demonstrated my willingness to take the initiative and go beyond my comfort zone. These projects provided valuable learning opportunities and helped me add significant accomplishments to my resume.Tips for Freshers that are coming in Data Engineering
Having a growth mindset and a willingness to learn continuously is important. Stay curious and seek learning opportunities to expand your knowledge and stay ahead of industry trends. This can include taking online courses, attending webinars, reading industry blogs, and participating in relevant communities or forums.
Familiarize yourself with different data storage systems, data processing frameworks, data integration tools, and cloud computing. This includes technologies like Hadoop, Apache Spark, Apache Kafka, cloud platforms, and database management systems. Understanding the strengths and limitations of each component will help you design robust and efficient data pipelines.
Focus on developing proficiency in languages like Python, Scala, or Java, commonly used in data engineering tasks.
Theory alone is not sufficient in data engineering. Seek opportunities to work on real-world projects or internships where you can apply your knowledge and gain practical experience.
Engage with the data engineering community, join relevant forums or groups, and connect with professionals in the field.Conclusion
From his initial foray into programming during a hackathon to his successful participation in multiple competitions, Mr. Pavan’s story is one of transformation and unwavering dedication. We hope his dedication, technical skills, and commitment to continuous learning inspire aspiring data professionals.
For those seeking additional career guidance, we recommend reaching out to him on LinkedIn as a means to establish a professional connection. Connecting with him on this platform can provide valuable insights and assistance in navigating your career path effectively.
Update the detailed information about Data Engineering For Beginners – Partitioning Vs Bucketing In Apache Hive on the Cancandonuts.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!