Look, Ma, I’m a Data Scientist!
Tom Groenfeldt just told me what to start calling myself.
What’s a data scientist?
In “Big Data Needs Data Scientists, Or Quants, Or Excel Jockeys,” he quotes Randy Lea at Teradata’s Aster Center of Innovation, who “defines a Data Scientist as a person with mathematical and statistical skills, an investigative mind, an understanding of computer languages like C++ and Java,” and ability to write code. Groenfeld’s own definition includes, “multi-skilled experts who understand programming, large-scale mathematics, statistics and business.” That’s me!
Why I claim to be enough of a data scientist to be useful (a.k.a. War Stories)
(Robert, I believe you—spare me!)
In graduate school, I crunched daily sounding-balloon observations around 10 years worth of typhoons and 27 years worth of hurricanes. I then built a bigger data set of winds from passenger jets and cloud motions in satellite loops and used that to study the exhaust plumes from hurricanes.
At the National Hurricane Center, I used multivariate analysis to build a statistical model to predict hurricane intensity—the first one to include environmental conditions as well as storm history. We had specially-modified confidence limits on our F-Tests to prevent spurious selection of predictors by our stepwise regression package (I’ll be glad to explain why that’s a problem in a way you’ll understand). The data set was small by modern standards but was pushing the limits of what we had to work with for storage—6250 bpi 9-track tapes.
At the University of Wisconsin, I built and analyzed sets of weather satellite data for all sorts of things–winds from cloud motions on satellite loops, apparent temperatures of the cloud and background to estimate the altitude of the moving cloud (without which the wind measurement isn’t worth much), and the air temperatures in the centers of hurricanes. The software we had to put map overlays on satellite pictures broke down over the polar regions, so I re-wrote it using vector algebra instead of trigonometry (I’ll be glad to help you understand why that was the right solution, even though management rightly questioned me when I said we needed to “rewrite the whole thing”—you should never let a programmer do that!)
The map outlines on our satellite pictures weren’t very good, so they asked me to redo them. I met my next Big Data (at the time), the Defense Mapping Agency’s Digital Chart of the World. It came on 4 CD-ROMs. We had a computer in the library with a CD reader on it. The problem of picking out the layers we wanted, especially just the “major” lakes and rivers, was too complex for me to keep straight and implement in C, so I asked my boss if I could learn C++ and program it in that (something else you should never do—let a programmer use a newly learned language on a major project!) My auto-feature extractors were not quite perfect, so I taught myself enough Java (the new new thing in 1995!) to build a little graphical editor to clean up our map overlays.
Lately, I’ve studied Bioinformatics, learned R, and used it and the Processing data visualization toolkit to visualize and understand xDSL broadband speeds.
Oh yeah, back at UW, I also figured out how to get an Excel spreadsheet to fit B-splines to gridded weather data. We needed to figure out how to get the boundary conditions right, and being able to visualize what was going wrong using Excel charts made all the difference.
On my last engagement, I really got my hands greasy with what I can only describe as Little Big Data–100K customers out of QuickBooks and into Fishbowl Inventory, including a largely automated address and inactive-customer scrubber, built in Microsoft Access. Probably beneath the dignity of most “Data Scientists,” but at the time, I didn’t know I was one.
Now I do! I’m a “Data Scientist!”
…Or maybe I’m a “Data Dog?”
Until I learned about Data Scientists, I (informally) called myself a “Data Dog”—“Dig up the Good Stuff and roll in it!©” So I’m still not too proud to take on your QuickBooks and Excel mess. But Hadoop and NoSQL are on my to-learn list.
So if you need a Data Scientist to wrangle some Big Data (or Little Big Data) in Madison, WI, I just might be able to help.