|
Usage scenarios for text mining are well known.
This attached tool and its concepts are based solidly on what the business needs are, and even if the executable demo is an approximation, what is practically achievable.
[ Maybe firstly I ought to put down that in my last 12 years of working with business apps, I have found that the apps which use some sort of web-scraping or text extraction always return value for money. They always win because the business wants real time pieces of semantic text. Read hi level biz story]
There are also some limitations one has to be aware of, in any text extraction scenario.
- the "set of documents" targeted cannot be the whole world wide web, so document sets specific to the business, or URLs specific to the business, are to be known in advance.
- efficiency cannot be 100%. In the attached tool, efficiency can be increased , but much depends on how clearly the desired text stands out, it is a foreground-background problem.
- just-in-time on-the-fly searches are not really an option. It has to be known beforehand what the users will habitually and recurrently look for in a set of documents. This problem does have a solution, which is ontology dependent.
List of usages.
- Data collection automation from raw sources
- financial, people, projects, any entity (web or intranet)
- desirables-only filter
- content change based events capture
- data warehouse side feed
- Clutter identification and removal
- unwanted ads browser/email plugin
- personal anti-spam plug-in
- irrelevant resumes
- summary for the CEO (among a zillion intranet docs)
- precise and short document thumbnails
- Subversive communication patterns identification
- Automated web-service feed for mash-ups based on relevant web docs (see extreme below)
- Specialized business feeds for a Silverlight or GoogleGears "lite" database
- CRM - profile thy customer as thou knows best from the 1st 20 google pages
- CRM - target your ads based on your own custom profile creation (puts some of these new fangled mobile ad-engines out of business, since all they need to supply will be the raw facebook page maybe, direct to the ad-agency)
- Build your own database in 30 days from web sources, including history
- Make an ESB that does intelligent alerting based on web-sources
- Many, many, so you add the rest please...
Footnote : the usage I am personally most interested in is called "putting the toothpaste back into the tube". Make of that what you will :-)
more usages, by popular demand...
The more "paperless" Co.s try to be, the more paper littered they become. Why ?
Because there is hardly any substitute for the human eyes scanning a document.
But not always.
If the documents you have to deal with are hundreds or thousands, and if they have sections of interest that are more or less similar, you can do machine work. Here are some more usages.
Get news and summaries, from web publications, tenders, google searches etc - that are of your specific interest.
Filter out (or in) your RSS feeds and blogs.
Document segregation : Large companies dealing with many projects have to also deal with zealous project managers sending out zillions of project reports which no one really wants to read - but everyone has to. How about a tool to extract/build up a summary, or notify when a specific context appears.
Lets say the chemical process industry produces numerous documents during a project, of which only 10 % have anything to do with process control instrumentation, and the instrumentation department is short handed, wants only to have to read those where certain specific things appear. (Feasible, though a nice front-end would be mandatory in this case).
E-Learning : Think of teachers on e-Learning platforms having to only create quizzes in objective one-line formats, having only 4-5 choices of format. But now you can specify a full-text format, because unless it is literature, you can set up a tool that can roughly compare an ideal answer with the received answers.
...and now, a complicated piece only of interest to some specific user segments.
A complicated MASH-UP scenario, called : "separating the fat-cats from the real corporate performers, in these times of reactionary distress..."
Here is a mash-up scenario. Lets say your company deals with high-level resourcing metrics, or executive compensation consultancy for high profile employees or directors.
- Use case begins when system wakes up, works at night, in background.
- System finds company annual report on the web ( say, SEC 10 K) , gets names of directors, also key numbers that indicate company performance for last 3 years.
- System gets definitive proxy statements (DEF-14A) for corresponding period. Gets the resumes of those directors found earlier. Gets their compensations, stock options, bonuses, blah.
- Goes to LinkedIn and finds their detailed profiles. Established their key specialities, education and so forth.
- Works out some metrices, graphs etc. Feeds these into a web-service payload for consumers of such pre-digested data.
- Does this for 10000 companies, roughly 120000 directors or high level executive staff, for 5 countries around the globe.
- Workflow is configurable, extracted pieces are configurable. So tomorrow the same system could be doing, say a workflow that involves patent submissions by inventors.
- And lastly, system allows for hits and misses, near misses, generate an audit log so that a human beings can backfill the machine deficiencies.
Sounds plausible ?
|
|