Hadoop Professionals

A Community for Hadoop Users

Jason Venner

A little more detail on how line oriented FileSplits work

This block written by Aaron Kimbal of Cloudera to core-user.

Aaron Kimball
to core-user


Reply

Follow up message
A FileSplit is merely a description of the boundaries. e.g., "bytes 0 to
9999" and "bytes 10000 to 19999". The Mapper then interprets the boundaries
described by a FileSplit in a way that makes sense at the data level. The
FileSplit does not actually physically contain the data to be mapped over.

So mapper 1 will open a file via the InputFormat and start reading at byte
0, and stop reading when it gets to its "final record," which is defined as
the first record which stops after byte 9999. If it has to read through
bytes 10020, that's ok. The stream used to read the bytes from the file will
not "cut off" at 9999.

Mapper 2 starts reading at byte 10000. It finds the first newline at byte
10020, so the first "real" record it processes starts at byte 10021.


Mapper 2, looks for the first record start after byte 10000

Views: 5

Comment

You need to be a member of Hadoop Professionals to add comments!

Join Hadoop Professionals




Groups

© 2012   Created by Jason Venner.

Badges  |  Report an Issue  |  Terms of Service