Showing posts with label information. Show all posts
Showing posts with label information. Show all posts

June 03, 2011

Liberating data using Scraper Wiki

Of all the wiki sites that sprung up after the original, one of the most useful and positively cool is ScraperWiki. Scraper wiki is an attempt to liberate data from websites and pdfs and instead populate spreadsheets with them.

There is a lot of data available on the net. But its value is severely limited by the fact that you cannot do much more than just browsing it. When you move data from a html page or a pdf file into a spreadsheet, suddenly the value of the data goes up many fold. Now you can analyze the data, sort it, look for trends and coax information out of it. ScraperWiki aids in the first step by scraping web pages and moving data into usable data sets.

ScraperWiki is two things. First it is a web-based compiler and reusable libraries (in Python, Ruby or PHP) that allows you to write and run a scraper. Second, it is a wiki store of scrapers written by others that you can then update, reuse or just run to get data.

There are quite a few interesting scrapers. This scraper collects data from weather stations across all of Germany, while this one collects the Location IDs from Weather.com URLs. Weather is not all scrapers do, this one for example collects basic info about all MLB players, while this one is an massive database of all soccer WorldCup matches.

Of all the untold millions spent by governments and corporations on digitizing their data and making web pages, a decent portion that went towards making html tables out of data sets. ScraperWiki is an attempt to reverse that. Cheers to liberating data from the shackles of the web.

March 09, 2011

The Information

The Information: A History, a Theory, a Flood, is a book by James Gleick, that has an interesting idea. That everything, including matter and energy, is nothing more than just information. The author talked to NPR, the interview is embedded below.

While the idea itself is obviously interesting, that is not the only reason I posted this. There is this post that I wrote on this blog, back in 2002. The idea of that post is surprisingly similar to that conveyed in the book.

I imagined a universe where matter is defined as...

"nothing but the extract of the information conveyed to us by the various input devices."

Going further, one could define the entire universe, with all of its constituent parts using ...

"A way in which there is no difference between the various units of matter, energy, ideas, minds and everyother thing in the universe. This unified way of looking at the universe is going to help us define the entire universe on a one dimensional scale rating information content."

I am super kicked that my ramblings of 9 years ago were not that empty. In no way is this meant to claim that post was even remotely complete - but there was that germ of a thought.

May 13, 2010

How cell phones work

Everyone has a cell phone now. Dialing a number and having someone answer seems so ordinary, it is almost considered routine. But, a lot needs to happen behind the scenes for that call to get through. After the break, a graphic explaining how this works in more detail.

Removed embedded image after email request from linkremoval@cellphones.org. To the team from cellphones.org, if you really did not want me to link to your image, maybe you should have not included a “use this code to embed into your site” option on the site. Thanks!

October 01, 2002

Rammstein

And more specifically stripped. Incredible song. Or for that matter kokain. Man those riffs just drive you out of your mind dont they.

Filled up a survey today, about some perception thing, of companies recruiting on campus. Was so totally painful. I dont really understand. Why did I spend so much time filling it up. There was this HUGE matrix, which had to be filled with my opinions. Someone did not tell them things properly. I dont have opinions. Not atleast as many to fill up that monstrous matrix of theirs. Well, I did try, for a while. As i tried to form opinions on the spot and them put them on paper. Do you know how hard it is to form opinions on the spot? It is. And if you are finding it easy, you dont form opinions, you just think you do. Trust me on this. :)

One of the most incredible things is the fact that most people around you dont bother to form opinions. They have a few of their own opinions. You can figure out that this is their own opinion when people can be completely irrational about it and its consequences. But most other opinions you see around are only the sum total of the opinions formed from the positive part of your sphere of perception, that is all.

Okay, enough of rambling. Lets continue with the discussion we were having last. In the last post, we talked about the a number of definitions that led to the definition of the LSI or the linear scale of information. Given any observation, it can be located on this scale. What is an observation. An observation is any representation of a Data Source or DS. A photo is an observation of some reality. A word is an observation of some idea. A poem is an observation of some emotion/idea. A simple sentence also is an observation. So is a complex mathematical model of the universe.

One peculiarity about the LSI should be kept in mind. The LSI stretches from 0 to infinity. It is unbounded on the upper side. This means that a DS lies at infinity, and a completely useless bit of information lies at 0. We define data to lie in the small reaches, closer to 0 on the LSI. Information, relatively is higher on the scale. It represents a higher richness of data about a particular DS. Knowledge tends towards the object itself. A picture, worth a 1000 words, is therefore higher on the LSI with respect to the words it replaces.

This can be extended to any object, idea, thought or any other information content without any modifications. We can therefore use this structure to compare and develop better and higher forms of information management systems. That is what is envisaged as the end objective of this study. This structure can be used to describe any informational content with ease. We will go into details about the implementation of this structure soon, but before that we shall look into the way this method can be used to model interactions.

We define an interaction to be a process that allows for transfer of data between a DS and a DA using a Data Transfer Medium. This is the simplest definition of an interaction. An interaction can give rise to one of the following results. Information will be transferred from the DS to the Data Acquirer. In addition, the DS can change its state due to the interaction of the DS with the DTM (also known as the medium). Further, the interaction between the medium and the DA, will cause changes in the DA. Note that these changes are in addition to the simple transfer of information that can be attributed to the interaction.

This in fact follows from the defnitions we had seen yesterday. We have already talked about a query that is used by the DA to get information from the DS. Now when the query travels from the DA to the medium, the medium has obtained information. This causes a change in the medium itself. When the query is transported to the DS, the DS undergoes changes because of the informational content in the query. The exact similar process occurs when the DS replies with the answer to the query. The reader may note that no change occurs in the DA during the asking phase of the query, and no change happens in the DS during the reply phase. The DTM undergoes change twice, with both the query and the answer.

Lets see some practical explanations of the entire structure. Any systemic structure can be abstracted using this. In fact, now with the addition of the term interaction, we can now model dynamic changes in systems too.

Mail me, if you think there is some structure that cannot be abstracted using this framework. We will go into more practical considerations using this framework in later posts.

This is the first time that I actually continued a post beyond just one post. That must mean, I dont really think this idea to be crap.

Regards,

~!nrk

September 29, 2002

Data, Information and Knowledge

This in short is the brief history of the universe. I am not writing this blog after being overly fascinated with the book with the title that sounds obscenely like the quip I just quipped. In fact I think I read that book a long long time ago. What I am writing this is because I have an alternate view of the universe. A view that breaks down everything into a point on the line, defined with ends of data and knowledge, with information lying in the middle. I know I am getting a little too abstract here, but then I hope things will be clearer to be as we go along.

When I used to study science, something struck me as very odd. Physics, especially of the variety that is normally taught in the high schools, breaks down the universe into two sections. The physical universe and the law that govern this universe. What struck me as odd, is the fact that god actually defined such a cute little dichotomy in his world. Just like we have data and instructions, male and female, good and bad, we have matter and laws. Okay matter, energy and all that dark crap too, but basically the tangibles and the intangibles. Okay, this is not also very true, but... Wow, this is tough, getting the definition right. But basically the problem with god and his universe boils down to this - how did he come up with something that exists, and then threw away a hell lot for us to discover. Why all this segregation? Why this duality? Why was matter there, for all of us to see, and the rest of the relationships, laws etcetera for us to discover?

But then think about it. What was matter. Ask someone in the dark ages, (dark ages NOT defined as the time before the computer) and their *ologists will tell you that it is nothing but a combination of air, water, earth, fire and something else. Somewhere down the time line, people will tell you that matter was made of unbreakable balls, called atomz. Then people went berserk. Matter was made of all sorts of strange. mystical and mythical substances, which incidentally no one can see, but ought to be there for matter to make sense.

So what was different in matter in the dark ages and now? Nothing. It is the same old matter burnt and forged into different shapes, but still the same old matter. What changed is information. The information known to man and this knowledge has changed the way people look at matter. If this information was not available, a lot of people in nagasaki would have been nth-generation residents, instead of what they are now. Matter has changed because of what matter is to different people. To the ordinary man, matter is nothing more than just earth and air. Hence what is important about matter is not matter itself, but information about matter. What we see as matter is nothing but the extract of the information conveyed to us by the various input devices.

We will now look at a totally different way of seeing the universe. A way in which there is no difference between the various units of matter, energy, ideas, minds and everyother thing in the universe. This unified way of looking at the universe is going to help us define the entire universe on a one dimensional scale rating information content. This will then give us a powerful way of dealing with many problems on a vastly simplified, unified methodology.

But before this we need to get some basic framework necessary. We postulate the existence of three different types of entities in this universe. The first is the Data Source or the DS. The Data source is characterised by the fact that it contains data. It owes its existence to the data it contains. There is no restriction on the data it contains. Of course we havent defined what data itself is. But we are deliberately not defining it, since it will be globally defined with the circumstance under view. And moreover, we cannot define it in isolation from other units underconsideration. Now the second entity we postulate about is the existence of the Data Acquirer or the DA. The Data Acquirer or the DA can query the DS for data through the use of what is known as a Data Transfer Medium or DTM.

Given these basic units, we define some terms. The first term we will be defining is the Data Completeness (DC) of an entity. DC is defined as the relative content of data of a particular kind in a particular Data Source. Hence DC is defined for a DS and Data Type. For example a DS has 100% DC about itself. Any DS can answer any question itself. So its DC is complete. Note that DC is independant of the query for data, or the way the query is designed, or the DA itself or the DTM for delivery. The actual response of the DS to a query is a function of the ability and capacity of the DTM and the DA.

This leads to an interesting and obvious statement. Any DS is 100% Data Complete with its own data. In fact, any entity, which is 100% DS with the data of any DS, is virtually indistinguishable from the DS itself. This is because the said entity can answer any question about the DS. This means that any DA cannot distinguish between the impersonating entity and the Data Source itself.

Now the DC itself does not give any powerful medium for expressing data relationships. Since the DC is fully defined with the data type and the DS, we define another term called the Relative Data Completeness or the RDC. RDC is defined as the relative compelteness of data given the DS, the DA and the DTM. For example, a still photograph has an RDC of close to 100% for the original static setup, given that the DA is just seeing them with just sight as the source of data input. The moment the DTM expands to include say touch, the photograph no longer has 100% RDC.

The RDC therefore gives a powerful medium to express the quality of data relationships between the DS, the DTM and the DA. We will dwell more on various examples for these terms in later posts.

Data is always handled in packets called observations. This observation is not the observation that is defined for an experiment. Observation is a taggable block of data. Observations differ from one another in their quality. Observations are generally substantiated by data. The amount of data represented by an observation is its relative richness. Richness of an observation is defined on a scale called the Linear Scale of Information or LSI. Data is one end of the LSI scale, while knowledge is the other extreme. Information is lying in between. Data are the small individual pieces of information, that border on indivisibility. Knowledge is completeness of knowledge. An entity which has a 100% of Data Completeness (DC) is perfectly knowledgable, and can infact replace the DS itself. An entity having an RDC (Relative Data Completeness) of 100% implies that for a particulat DTM and a DA, the entity appears to be the DS itself.

We will stop this round of definitions here. Check back for more data and information on these terms soon.

tada for now

~!nrk