Back to articles list Articles Cookbook
7 minutes read

What Is Data Engineering?

Data Engineering is a fairly new term in IT. And it’s getting more and more attention. You may have heard about a few similar fields like data science, Big Data, and machine learning. This article explains the difference between these concepts and shows how they can be combined to analyze vast amounts of data.

When computers first appeared, their storage capacity was very limited. Do you remember floppy disks? They were popular in the late 20th century and typically offered around 1.5 megabytes of storage. You probably couldn’t fit a single modern photo file on this disk, let alone a whole photo album.

floppy disk

Today, the only reminder of these disks is the "save" icon in Word and other programs, which looks like a floppy disk. The average personal computer can now store gigabytes and terabytes of data – thousands of times more than what they could manage thirty years ago.

In the past, we didn't have to think much about organizing data. Nowadays, companies store such large amounts of information that they need to carefully plan how to organize and access it. This is where data engineering comes in – it's all about efficiently storing and handling vast amounts of data.

Data engineering teaches us how to create data processing pipelines, where to keep huge reservoirs of data, and how to maintain the infrastructure around all of them.

Data Engineering and Big Data

Data Engineering is closely related to the concept of Big Data. Big Data is essentially massive sets of data that are too big to be processed in traditional ways. But what do we mean by "too big"?

In early 2020, Netflix had over 182 million streaming subscribers worldwide. Each of those subscribers picked a different set of videos to watch, streamed different shows, and stopped them at different times.

All of these events are valuable pieces of information that Netflix may want to store and process. Naturally, you can't just put data about 182 million subscribers in an Excel file. You need some more sophisticated tools – Big Data technologies.

Netflix’s Stranger Things is an excellent example of Big Data in action. It is a critically-acclaimed TV series that was born from data analysis. You can read more about it here.

Some of the most popular big data technologies are:

  • Hadoop: A platform to store, process, and analyze massive amounts of data.
  • Spark: An engine for the distributed processing of parallel data streams.
  • Cassandra: A database management system specifically designed to handle large amounts of data.

Big Data is a rapidly evolving field, so we can expect new tools to appear soon.

Data Engineering vs. Data Science, Artificial Intelligence, and Machine Learning

If you've heard about data engineering, you've probably also heard about data science, artificial intelligence (AI) and machine learning (ML). They are all different concepts, but they are often used together to get as much as possible out of data sets.

Data science is an area of science that combines statistics and programming to derive meaningful insights from data sets. However, before a data scientist can start analyzing data, these sets need to be prepared by a data engineer. The data engineer will typically set up a data storage solution (such as a database) and fill it with data from one or more sources (e.g. physical devices). Then the data scientist can start their analysis.

Artificial intelligence, or AI, is a branch of computer science focused on building "smart" machines. Such machines can perform tasks that originally required human intelligence. Scientists who deal with artificial intelligence aim to create devices that make "intelligent" decisions and mimic – or even surpass – human capabilities.

In most circumstances, large amounts of data are required to "teach" machines to behave in a human-like way; this is exactly where data engineering gets involved. Data engineers create data pipelines and data storage solutions that make AI experts' work easier.

One of the subdisciplines of AI is machine learning. It focuses on self-learning algorithms, i.e. programmed processes that improve automatically. In this case, improvement is possible via large amounts of training data. Once again, data engineering helps establish proper data flows and storage solutions, efficiently making the information available to machine learning algorithms.

As you can see, data engineering is not the same as data science, artificial intelligence, or machine learning. However, they are often used together. A data engineer can help collect and provide access to large sets of data for data scientists to analyze. Data engineers can also help AI experts "teach" their machines to behave like humans or give an ML algorithm the right training set.

Relational Databases and Data Engineering

The concept behind relational databases was invented back in 1970. These databases have been in wide use ever since. In the vast majority of cases, they store information in tables, which are frequently connected in some way.

We use SQL (Structured Query Language) to retrieve data from such databases. SQL was also invented in the 1970s and has been popular ever since. It is the lingua franca of the database world. No matter which database vendor you work with (e.g. Oracle, Microsoft, or Postgres), you can retrieve data using very similar or even identical SQL queries.

As the amount of data in databases grew, some started claiming that the traditional tabular architecture was getting too complicated. Thus, NoSQL databases started gaining popularity. Among the most widely used NoSQL tools are MongoDB, Cassandra, and Redis. These tools store data differently than relational databases and they offer data retrieval mechanisms other than SQL.

The term "NoSQL databases" is a little misleading; most of them will work with SQL in some way. For example, even though you can't directly use SQL to retrieve data in Hadoop, the Apache Hive project allows you to run SQL-like queries on datasets stored in Hadoop. Because of that, some people prefer to use the term "Not only SQL" instead of "NoSQL".

The bottom line is that both relational databases and NoSQL databases typically offer the possibility of using SQL. So, SQL is a very good starting point if you're thinking about a career in data engineering.

Learning SQL for Data Engineering

Most people only learn to write SELECT statements in SQL, since this is what you need to retrieve data from databases. This is also why most SQL courses focus on SELECT statements only.

However, for data engineering purposes, you should also learn to use SQL to manage the database structure. You need to know what tables in relational databases are, how to create them, and how to work with different data types. You should also understand how to add other features to a table – such as constraints, indexes, and views – that make working with databases easier.

The LearnSQL.com platform has a dedicated Data Engineering Path that contains five important courses:

We understand that installing a database management system on your computer and setting it up properly can be difficult. That's why LearnSQL.com does all of that for you; all you need is a web browser with internet access. The platform prepares the database (and everything else) for you; you can focus on learning the core concepts of data engineering.

With LearnSQL.com, you also get all the basics explained in one place. Rather than trying to gather various articles, tutorials, and courses scattered around the web, you get a full learning experience with us. This new learning path is specifically tailored to the needs of future data engineers, so you don't need to spend time deciding what is important for this particular career path.

Is Data Engineering in Your Future?

Data engineering is a fairly new field in IT. It focuses on creating database structures and processing large amounts of data quickly. Data engineers typically use multiple Big Data technologies in their daily work. This is a very dynamic field, so we're interested to see how it develops in the future.