Well, human voices typically have a base frequency between 70-400Hz. You can look for spectral peaks there. If you want more accuracy than that, you can analyze the spectrum for formants.

If the person isn't articulating - just going "ooooooo", "eeeeeeeee", or "ssssssss" - your job gets much harder.