Beware the Wikipedia Scrapers
Posted on November 4, 2011 Posted by John Scalzi 31 Comments
Over at Chaos Manor, Jerry Pournelle sounds an alarm about Hephaestus Books, which at first blush appears to be publishing his (and many other science fiction writers’) works without authorization, and cluttering up the search results of the online book stores with their wares.
I took at a look at it; it seems that what Hephaestus Books is doing is that thing where unscrupulous jackasses scrape Wikipedia for articles on authors and books, prep files to give the appearance of a book, charge a ridiculous sum for the material and bang out either a POD object or an e-document when something is (almost always inadvertently) ordered. Sometimes the Wikipedia-derived nature of this crap is made apparent in the description text, but just as often it is not. If you’re not paying attention you can think you’re buying a collection of books when what’s really happening is that you’re being ripped off.
I first ran across this last year on Barnes & Noble’s site, and to that retailer’s credit, they subsequently changed their search parameters so that this sort of junk falls further down into the search queue. But it doesn’t ever go away, as you can see from this search; scroll down far enough and this crap is still there, not just from “Hephaestus Books” but also from “Fonte Wikipedia,” “Books LLC” and other such folks preying on your inattention. It’s not only Barnes & Noble were this stuff pops up; Jerry Pournelle found this stuff on AbeBooks as well, and I’m sure there are other places this stuff will appear.
Naturally none of us writers want you to get scammed by this crap; I don’t think any of us are any happier about it being out there than you are. Here are some clues to look for to avoid this junk:
1. Generic cover art;
2. Titles that are a list of authors, or of a particular author’s work (usually when we publish an omnibus collection of our works we’ll give it a unique name);
3. Publishers you’ve never heard of before — this is particularly the case with established authors like Jerry Pournelle, whose work is primarily with well-known imprints.
4. Small page counts — if you look at the page counts for this one, as an example, you’ll see it is but 50 pages long. That’s because Wikipedia articles usually aren’t very long at all.
5. Description data which notes the provenance of the material — although not every listing will have this.
Regardless, it’s certainly appears that these people are hoping to get you to buy something other than what you think you are. So caveat emptor, my friends. Pay attention before you click the “buy” button.
This stuff also pops up on Amazon and Alibris, among others. It hits everything Wikipedia cover – so any subject you can think of. It’s something to stay aware of if purchasing textbooks online, too; a classmate at my graduate school once purchased a poorly printed Wikipedia article instead of an assigned text, and even paid for overnight shipping.
This is one of the things that has tipped me over toward launching a self-publishing venture of my own; right now most of what you can find by googling me electronically is deceptive/name-fakes and pirates. Honestly, it makes me like the pirates so much better; at least the reader gets my book.
This is a peril with actual public domain works, as well. Hey, I am willing to spend money for nice version of, lets say for instance Edgar Rice Burrough’s “John Carter of Mars” books. The trick is that sometimes it is hard to negotiate through the glut of badly formatted & almost unreadable POD stuff.
Like we didn’t have enough problems with spam?
I seem to recall seeing “ebooks” for sale that were in the public domain/ lifted from Project Gutenberg (Little Fuzzy, some other H.Beam Piper’s works) springs to mind – as always – buyer beware
C.J. Cherryh on Monday published the same gripe regarding Hephaestus at Barnes and Noble, and basically asked her readers to contact B&N and raise hell.
I honestly don’t understand why they’re not guilty of fraud. I assume they _are_ guilty of some sort of false advertising thing, but it’s not been worth anyone’s time to stamp it out. OK, a literal reading of the description bears some relationship to the content, but it seems pretty blatant they’re not actually trying to sell that, they’re putting those words where an only reasonable interpretation of them is something else they hope to make money off.
Can someone explain this issue to me? Is it for example that someone is taking an article on Wikipedia *on* John Scalzi, printing it, and selling it to someone as if it were a book *by* John Scalzi? I’ve never heard of this, so I’m a little behind the curve.
@MarkH – for the money. And because people suck. This is Bait and Switch. You think you are getting Old Man’s War, but what you are really getting is a Wikipedia article about Old Man’s War.
Its not fair to anyone. Except the low life who’s perpetuating the con.
MarkH – that is exactly what they do. It’s not just science fiction authors. I first ran across this in technical books, where they would advertise a fairly expensive book on “OpenGL” or “Computer Graphics”. I got interested by a new book, and looked into it…and it is literally a printout of a whole bunch of Wikipedia articles on these technical topics.
I don’t know who would fall for this, but I suspect that libraries and universities, and corporate buyers might. They have a budget to buy certain types of books, so if they do not carefully check first, they might accidentally buy a copy.
As prez of SFWA, do you have flying monkey lawyers you can send out of your tower to chase down people like this? Because that would be pretty cool.
I don’t generally talk of my SFWA flying monkeys here. Briefly put, however, this is in our remit.
Betascript Publishing is the worst one- they’re identical in manner to these, but worse due to their subject choices. Check this book: “Warm Worlds and Otherwise: Alice Sheldon, James Tiptree, Jr, Robert Silverberg, The Girl Who Was Plugged In, Hugo Award” (ISBN 9786133549050). So, articles on three authors, article on a short story by one of them, and the article on the Hugo Award. Feels almost like an automated script just pulling articles that are linked to each other, rather than any intelligent agent selecting like things.
I’m on indie author that sells about 10-20 books a month, nothing amazing but I realized this was happening to my book about 3 months ago. I noticed it on Amazon and then Barnes and Noble. I looked into and came to the same conclusion you did but I was shocked that companies such as Amazon or Barnes and Noble do not have a way to regulate this. I imagine these people have a lot of time on their hands to just add every book they see and I’m sure they scam enough people out of their money to make a decent profit, it’s pretty sad that people would do this.
Oh and maybe you could loan me one of those flying monkey lawyers if I ever needed it.
Now I imagine there is a guy with the “football” who follows Scalzi around. It’s a briefcase that contains keys to the cages of the flying monkeys.
I’m guessing that ever since Google came out with the Panda update that dropped the useless spamblogs to the bottom of the search results, these bottom-feeders had to change tactics, so instead of tricking you into clicking on a link for the ad revenue, they trick you into buying their recycled ‘content’ in print or ebook form.
CJ Cherryh has run afoul of these peckerheads several times before, and just put the word out on the street to her readers about the latest incarnation. I am glad to see that the science fiction community is becoming more aware of this scam, especially the writers with deeper pockets, like Mr. Pournelle and Mr. Ellison. I would like to see Harlan Ellison particularly going after these jerks in his own inimitable style; I would pay good money to see a copy of his cease and desist order. Unfortunately, it’s like playing electronic whack-a-mole with the scrapers, as they seem to have an infinite number of aliases for this business model.
These people are a problem. There isn’t much we can do about them while the content is under a free licence – we do this Wikipedia thing in order to spread it far and wide, after all – it’s hard enough getting across the idea of freely reusable content as it is. Wikipedia editors get particularly annoyed when they hit these things, think “woohoo, another print source to use!” and discover they’ve paid $50 for … their own work.
Our viewpoint is “reuse our stuff, that’s what it’s there for – but make it damn clear what it is.”
The problem will largely solve itself: if physical copies of Wikipedia articles ever gained any actual popularity, competition would kick in very fast. Even competition on quality would fail, as people sought to design beautiful editions just because they could.
In the meantime, please do get the word out about these things :-)
BTW – the casual reader encountering these things may not be aware of the business model. These are print-on-demand books, compiled by computer from a list of keywords. *No* copies exist until someone orders one, at which point a single copy is printed and sent. People aren’t generally aware that POD is very good quality these days — you can send a PDF to a machine and have it spit out an absolutely beautiful perfect-bound book for you, of a standard which previously would have been quite pricey. So these people manage to eke out a tiny profit on single copies, having worked out a way to spam Amazon.
Seconding what David Gerard says – as a Wikipedian, this is considered highly irritating within the community, but there’s a general consensus there (and at the Wikimedia Foundation, including its attorneys, as I recall) that it’s just scummy, not illegal from our side of things.
The books do apparently generally contain legally compliant source information and license inofrmation.
From the customer side of things, and the authors who are being drawn in by the impersonation / appropriation of identity aspects, there are better cases that fraud of some sort is going on.
Yeah, I already ran into these guys–fortunately not by buying their work, but while indexing my collection to be transferred to ereader. You see, we culled a lot of books, and I would go through every pile and see if it had an ebook version. So I ran into Hephaestus Books while trying to see if I could complete my Anne McCaffreys, because a surprising number of her books aren’t available as ebooks….except in a collection from Hephaestus. So I looked into it and discovered it wasn’t legit pretty quickly. They and their ilk made it even harder to determine if the book was legitimately available or not. Sigh.
Coincidently a few days ago I found a bunch of these books when I was searching for a particular title that is going to be released next year – “Redshirts”.
John had warned us specifically about the LLC group some months back, which was interesting timing since I’d gotten (from Amazon) and read OMW and was working my way through the rest of the set via their “save for later” shopping-cart feature.
It wasn’t long afterwards that the LLC Scalzi one turned up on my recommendations list, naturally enough …. so I posted a “review” of it and of LLC generally based on John’s comments and entitled **BEWARE**. Maybe that actually got the attention of someone at Amazon (Canada), since I haven’t lately seen anything there from LLC even by searching.
I’ve encountered what could almost be described as the flip side of this. I work in a University and a colleague had requested help in creating a chapter for an open source book publication which operated on a pay to publish system. We agreed and started to help out.
It was only when I had some spare time and researched the company involved that I found out it was essentially a scam. The books/journals they produce are essentially worthless with no peer review and absolutely no value beyond vanity publishing. All of their profit is from unwitting people coughing up the money for the publication. Vanity publication for the scientific market, with a veneer of respectability.
It has been suggested that some sort of legal protection for titles would help with this (you can trademark a series name or imprint, but you can’t copyright a title). I don’t like the idea much as I’d hate to have to change the titles of my next three novels, Gone with the Wind, From Here To Eternity, and Rocket Ship Galileo ….
(slightly more seriously, multiple titles are a cat so far out of the bag — there have been more than 100 commercial novels named Bloodline, more than 100 named Legacy, etc. — that you’d never be able to do anything fair or effective about it, so no, I don’t support the idea).
Okay, this one is too much fun. Somebody running this scam got me and one of my books muddled up with a hip-hop record released by one of the greatest soccer players of all time (the British, too, have a tradition of turning athletes loose on the unsuspectiing music and film worlds). Uh, that’s not me in the picture, but that’s my book title and publication date. And the description below is everybody ever known to wikipedia with my name, which is about 15 people. Apparently we all teamed up to produce an album with the tracks out of order.
In case it helps, I did some research on this and wrote two short articles, which are here and here.
As a brand new Canadian publisher, who will be publishing Canadian SF&F, I’m damned annoyed at these jerks. The good thing though is that Smashwords is clean, and the Apple and Sony stores appear to be clean too.
One possible out is to educate people (primarily on Amazon and B&N) to ONLY buy books with the “Search Inside” or “Look Inside” capability, where the publisher lets you take a limited look at the pages in the book. You can immediately separate out the junk from the quality that way.
Terry Kepner – FlyingChipmunkPublishing . com
Some of these trash books do, remarkably allow “Search Inside”. And what you find there is not a pretty sight.
In my post “The Trouble with Amazon” I stated:
6. Amazon enables trash publishers
I’ve written to Amazon, as have others, to ask why, for example, it offers for sale 421,014 titles from Kessinger Publishing, one of several companies accused of copyfraud. Or why it offers George Andersen’s classic work, Steve Jobs, purloined, if that is not too harsh a term, directly from Wikipedia, and available for $1.99, of which I assume Amazon earns 59 cents.
We have to just keep the pressure on Amazon and B&N and others. It’s completely shameful.
Hi Everybody, I know it’s an older thread but I have just found out about these scams and I wanted to say what it looks like to me as an IT specialist.
I think they use an automated process that goes to wikipedia, gets a list of categories and for each category it will parse the list of pages in that category and build a PDF out of it.
For example one of the books I have seen is “People from Lawton, Oklahoma. Including …”
They just got the category from http://en.wikipedia.org/wiki/Category:People_from_Lawton,_Oklahoma
picked some random pages from the list (or maybe the ones with more content than others) and automatically built a PDF that they (probably also automatically) submit to Amazon and others that have an API to manage your account. And then they have a book: http://www.amazon.co.uk/People-Lawton-Oklahoma-Including-Journalist/dp/1243829842 .
This is all basic stuff if you know what you’re doing and could be done in a weekend, maybe some more to tweak it properly to have a better page design. On top of that it can all be done using free software tools.
The only illegal thing is probably the deceptive title (although they explain that in the description, at least on Amazon) as people mentioned above but I am really annoyed when scum like that wants to earn money without bringing anything of value to the table. I would never buy those once I have noticed the description, but I started looking into it when I saw there are hundreds of thousands of books like that only on amazon (sold by several publishers)
Even if it is not illegal, there should be something called “consumer protection” that would force them to make the product more clear. that’s it.