Photo (cc) by luckey_sun.
“You either look for data and you don’t know what to do with it; or you have an idea about what you want to do, [but] the data that you need to get isn’t available easily,” said Kevin Wall, a visual journalist at The Boston Globe.
Wall is just beginning to delve into machine learning on his own and assessing its potential use for The Boston Globe. “We need a lot of data for machine learning and deep learning, so it can be tough because you will need teams of people to get [that] amount of data,” he said.
Like Wall, the journalism industry is still scratching the surface of these cutting-edge data science tools. There are only a handful of projects out there – such as the search of hidden spy planes by BuzzFeed News, the analysis of the misclassification of serious assaults by the Los Angeles Times, and image recognition of members of Congress by The New York Times.
“For now, journalists and the media industry as a whole are recognizing that AI[artificial intelligence] and ML[machine learning] can benefit them, but it also represents this drastic shift from what otherwise has been a very stable industry for the last couple hundred years,” said Alex Siegman, an AI technical program manager at Dow Jones, a worldwide news and information powerhouse for business news and data. “This is something that’s still very new, and a lot of newsrooms are exploring what it means for them and how they can derive benefits from it.”
What is machine learning
Simply put, machine learning is when a computer model is trained by a “teaching set” of data, which can then identify patterns substantially faster and more effectively than a human being. An example of this is introducing a model to a large set of cat and dog images as the “teaching set,” which allows the model to distinguish between pictures of cats and dogs at a high accuracy.
To summarize, machine learning is “finding patterns in large amounts of data and making predictions based on historical data,” said Siegman.
There are two aspects to using machine learning in journalism: as part of investigative reporting, or as a day-to-day tool to make journalists’ lives easier.
Machine learning as part of investigative reporting
“There are probably relatively few circumstances under which reporters are going to need for one to acquire machine learning – it’s really where you’ve got a classification task,” said Peter Aldhous, a reporter on the science desk at BuzzFeed News.
Aldhous is the author of Hidden Spy Planes, where he used machine learning to identify further covert spy planes. The project won the John S. Knight Journalism Fellowships award for innovation in 2018 Data Journalism Awards by the Global Editors Network, an international association with the mission of fostering digital innovation in newsrooms all over the world.
Aldhous said his plane project was a rare case where machine learning was actually a good fit, because there was a large enough dataset to train the model. “I had very good data on these aircrafts, and a lot of it,” he said.
Aldhous successfully acquired four months of flight data from more than 100 known government aircrafts. From that, he was able to build a model which could flag planes that might have been surveillance aircrafts.
But he warns that there is danger of data reporters getting too excited about this shining new tool. He said Rachel Shorey, a software engineer in the interactive news department of The New York Times, summarized this sentiment well at the National Institute for Computer-Assisted Reporting (NICAR) conference in March: sometimes simple things like a simple keyword alert or standard statistical sampling techniques might just do as good of a job in an even shorter amount of time.
“We need to use the right tool for the right job,” said Aldhous. “[For much of what we do] we don’t need machine learning; we need good data reporting.”
Although the need of machine learning in the newsroom is relatively rare, Shorey pointed out what actually happens when journalists implement this technology in their reporting. The process is “much more haphazard than is desirable,” Shorey wrote in an email. First, reporters find a good library in their favorite programming language; second, they read the documentation; third, they confirm that the methods are a good approach and they understand the inputs and outputs (even if not all the underlying math); fourth, they spend days to weeks cleaning data; and last, they write about 10 lines of code to execute the machine learning process.
Machine learning as a day-to-day tool
“There’s a lot to what journalists have to do,” said Siegman, an AI technical program manager at Dow Jones. “If you can use technology or machine learning to automate or even semi-automate any part of that, that is a great benefit to journalists.”
Machine learning can help journalists with their day-to-day tasks, such as finding stories, doing photography and videography work, or editing and publishing their work on social media, he said. This can be done through little things such as automatically transcribing recording, using image recognition to identify someone in a photo, and captioning videos; or through a larger task such as finding specific information that’s beneficial from a huge influx of content from sources such as social media.
Siegman thinks machine learning or artificial intelligence is nothing more than just a tool. Ten or 20 years from now, he said, people will think about machine learning just like how we think about Microsoft Excel today. “It’s [just] a tool that we are using to perform certain job functions,” he said.
The ethics
“I would not be happy, in journalism, using blackbox machine learning methods that I don’t know what they are doing,” said Aldhous.
Aldhous said transparency is crucial in journalism – reporters should be able to explain what they did. And at the same time, readers should be able to repeat what reporters did.
Algorithmic accountability is also vital. “One of the most important things journalists need to be doing is actually doing watchdog reporting on how machine learning algorithms are being used by companies and by government,” said Aldhous.
Aldhous thinks watchdog reporting around those issues is even more important than journalists using the algorithms themselves. He said there is a “potential for bias in any algorithmic decision.”
This can happen when a training set includes societal bias that machine learning picks up on, said Carlos Scheidegger, a computer scientist from the University of Arizona. “There’s very little you can do to validate your results if there’s a problem with the way that a classifier you are using worked,” he said.
Both Siegman and Aldhous mentioned an example of how Amazon used an algorithm that was biased against women as their recruitment tool. The system was trained on data over a 10-year period submitted by mostly male applicants. It then started penalizing resumes that included the word “women.”
“The bias precipitated through the algorithm, and into the real world,” said Siegman.
Siegman thinks privacy concerns are also alarming. “To use any machine learning, you need lots and lots of data,” he said. “And there are privacy concerns around how you are collecting that data from users.”
The future of machine learning in journalism
Aldhous thinks there is a future in machine learning, but more on the publishing side – such as how to organize, distribute, share and display content to attract more readers. “But as time goes on, we will get a better idea of when it’s the right tool for the job, and when it just overkills or is not necessary,” he said.
Siegman agrees. “Don’t think about where we can use AI,” he said. “Think about what problems you are facing on a day-to-day [basis], and then evaluate whether or not AI might be a possible solution to that.”