Data lost after Hdfs client was killed

Posted on

Data lost after Hdfs client was killed – Managing your servers can streamline the performance of your team by allowing them to complete complex tasks faster. Plus, it can enable them to detect problems early on before they get out of hand and compromise your business. As a result, the risk of experiencing operational setbacks is drastically lower.

But the only way to make the most of your server management is to perform it correctly. And to help you do so, this article will share nine tips on improving your server management and fix some problem about java, linux, hadoop, hdfs, .

I wrote a simple tool to upload logs to HDFS. And I found some curious phenomenon.

If I run the tool in foreground and close it with “Ctrl – C”, there will be some data in HDFS.

If I run the tool in background and kill the process with “kill -KILL pid“, the data has been processed is lost and leaves an empty file in HDFS.


My tool has tried to do sync (by invoking SequenceFile.Writer.syncFs()) frequently (every 1000 lines).

And I just couldn’t figure out why the data was lost. If my tool has run all day but the machine crashed suddenly, will all the data be lost?


My tool is used to collect logs from different servers and then upload to HDFS (aggregating all log to a single file every day).

Solution :

You’re really doing two fairly different tests, there. Ctrl-C delivers SIGINT to your program, but you’re sending SIGKILL. I would expect different results between them — for instance, POSIX states:

   The signals SIGKILL and SIGSTOP cannot be caught or ignored.

You could do an strace to see the effect of your syncFs() call. Does it actually call one of sync(), msync(), fsync(), fdatasync(), etc? Also, consider different implementations: can you close the file during inactivity/idle?

Leave a Reply

Your email address will not be published. Required fields are marked *