More than 4.57 billion people around the world now use the internet — close to 60% of the world’s total population. And the users are increasing at the rate of about 8% every year. With increasing users on internet & social media specifically in recent time has increased usage & storage of data to many folds. And most of these data needs to be saved persistent or permanent. Obviously you don’t want your data to be removed on Facebook or other social media without your willing. So, how these tons of GB or TB of data being managed & stored by some big giants like FB, Netflix, etc. Let's see!
◉ Data becoming bigger every day:
Facebook needs to retrieves & stores about 500+TB of data & 2.5 billion pieces of content every day, similar cases with other social media platforms like Twitter, Instagram, LinkedIn, etc. As data is increasing day by day at such a rapid pace & the concern of their security is also need to be maintained, hence some management strategy is required to store data.
This Problem of the rapid increase of data each & every day which give rise to some other problems like Volume, I/O issues, etc is known as BigData. BigData is not a technology rather it is an issue which needs to be solved by certain tools & technologies.
The term “BigData” refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods. The act of accessing and storing large amounts of information for analytics has been around a long time.
Types of BigData:
- Structured: Highly organized information that can be readily and seamlessly stored and accessed from a database by simple search engine algorithms. For instance, in an organization employee details, their job positions, their salaries, etc., is generally present in an organized manner.
- Semi-Structured: Data that does not have much proper structure associated with it. For instance, the data that is present within the log files, repositories DataBases and the word documents can be referred to as Semi-Structured Data
- Unstructured: Data that lacks any specific form or structure whatsoever. This makes it very difficult and time-consuming to process and analyze unstructured data. Email is an example of unstructured data.
The concept of BigData gained momentum in the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data as the four V’s:
◉ Facts related to BigData:
- No. of users of some big social media platforms: Facebook 2.449 billion users, Instagram 1 bn, LinkedIn 660 million, Pinterest 265million, Reddit 430 million monthly users, Snapchat 210 million users daily, & so on.
- Facebook receives about 500+TB of data & 2.5 billion pieces of content every day which needs to be stored permanently.
- Boeing 787 hit the market a decade ago, it produced an average of 0.5 terabytes of data per flight.
- A single Jet engine can generate 10+terabytes of data in a half-hour of flight time. With many thousand flights per day, the generation of data reaches up to many Petabytes.
- Per second, no. of posts, email, or tweets done on a various platform mostly on social media:
◉ Problems caused by BigData:
There are various problems caused by BigData but let's talk about two major ones: Volume (size) & Velocity (I/O).
- Volume (size): To save data we require storage devices, but the companies initially don’t know how much storage they require, as no. of users changes every day. Therefore, companies like FB just directly buy a single storage device of about 1TB or 1EB of storage, which may exhaust in a few months, & it also causes some more issues.
- Velocity (I/O): As the volume of storage increases, the speed of I/O i.e., reading & sending data increases. Like if Fb starts saving our data in 1TB storage device, it may take a day or two to open or send our data like images, videos, etc. Lower the volume, faster the I/O speed.
◉ Solution for BigData:
- For solving the above issues of Volume & Velocity related to data, we can use or rather most companies like Facebook use the concept called Distributed Storage.
- Distributed Storage is just a concept, which provides the following benefits:
- Huge Storage is striped into parts, which is saved individually in multiple storage devices. This reduces the Velocity (I/O) as each storage device contains a small volume.
- Striped data is stored in multiple storage devices simultaneously or parallelly therefore it increases the speed of storing the data.
- Issue of Volume (size) is solved, we can add as much volume by adding more storage devices whenever required. Hence, we initially don’t need to worry about the huge volume of storage, rather it can be added as per requirements.
Hadoop: It is a technology which is generally used by most companies like Facebook to solve the issue of BigData by applying the Concept of Distributed Storage.
I created a flow chart showing the demonstration of Distributed Storage concept using Hadoop on a small scale. Here, Data Node laptop represents storage devices. This a Multi-Node Hadoop Cluster:
◉ Hadoop Applications in the real-world:
- Security and Law Enforcement
- Enhance customer satisfaction and monitor online reputation
- Monitor patient vitals
- Healthcare Intelligence
- Track clickstream data
- Track geolocation data
- Track sensor data
- Strengthen security and compliance
— Some major companies using Hadoop include Amazon AWS, Cloudera, IBM, Microsoft, Facebook, Hadapt, ScienceSoft, etc.
There are multiple uses of Hadoop in BigData world, not just in Analytics. Hadoop is the key tool Facebook uses, not simply for analysis, but as an engine to power many features of the Facebook site, including messaging. As Data is increasing day by day their management & security also needs to be enhanced. Most of the BigData issues can be solved using Hadoop. In Future, with the advancement in the technologies like Neuralink founded by Elon Musk will produce an enormous amount of data on daily basis at speed of mind, hence will be a point of concern for us as Technical guys!
Thanks for reading! Meet you in another Blog or Project…