header
header Register : : Login header
header
divider
menuleft
menuright
submenu
left

We have a new sponsor!  Introducting Pragma Systems.  See the home page for details.

Querying a CSV file
Last Post 03 Mar 2010 06:38 PM by Paul-B. 5 Replies.
Printer Friendly
Sort:
PrevPrev NextNext
You are not authorized to post a reply.
Author Messages
Paul-BUser is Offline
New Member
New Member
Posts:19
Avatar

--
02 Mar 2010 09:05 PM  
Evening,

I'm looking for some advice on the best / most efficient way of querying information in a CSV file. The file has around 30,000 rows and 7 columns, each row is already unique based on the first column value. I can use Import-CSV to bring the data in but not sure how to iterate through the data based on second CSV (200 rows, single column).

So to expand, the first large file contains distinct domain records containing information on the connection status, i.e. success, failure, connections, messages, etc. The second smaller CSV is a list of domains. What I'm attempting to do is get the detailed information for the small list of domains based on the detailed infomration in the the all domains list.

So far I've looked at iterating the smaller CSV and using the input as a constraint for where-object whilst moving over the larger, detailed CSV. But this seems very inefficient as I'm repeating this 200 times, re-reading all 30,000 objects). I've also looked at placing the 30,000 objects into an associative array where the domain is the key and the value is an array of the other details. However, building the associative array takes forever.

Any suggestions?

MT

Paul
glnsizeUser is Offline
Basic Member
Basic Member
Posts:186

--
03 Mar 2010 01:11 AM  

You could try group-object I've used it in the past to collate large sets of data.  Never 30,000 that's just gonna take a long time to process!

$domains = Import-Csv ./some.csv            
            
Import-CVS ./ReallyBig.csv |             
    Group-Object Name |             
    Where-Object {$domains -like $_.Name} |             
    Select-Object -ExpandProperty Group

You may want to look into just ingesting the data into a simple database.

~Glenn

Paul-BUser is Offline
New Member
New Member
Posts:19
Avatar

--
03 Mar 2010 07:15 AM  
I did consider whether I should use dotNet and datasets but was looking for other peoples opinion on the best way to do this task
GWHowarth88User is Offline
Basic Member
Basic Member
Posts:336
Avatar

--
03 Mar 2010 10:14 AM  
If you're optimizing for performance, the easiest way would be something like this (if my assumptions are correct):

Use a StreamReader to read each line in the smaller CSV and populate an ArrayList with the domains
Load the larger CSV with Import-CSV
For each object in the larger CSV, compare the object's key with each element in the array of domains. If a match is found, select that object, remove the domain just found from the ArrayList to decrease the size of the array for the next comparison, then break from the current iteration to prevent the object's key being compared with the remaining domains in the array.

Translated to code, that would look something like this:

$domains = New-Object System.Collections.ArrayList
$reader = New-Object System.IO.StreamReader -ArgumentList "myDomainsCSV.csv"

while ($reader.Peek() -ge 0) # While not end of file
{
$domains.Add($reader.Readline())
}

$reader.Close()

$domains.RemoveRange(0, 1) # Remove the first element because it is the column header

$matchingRecords = New-Object System.Collections.ArrayList
$records = Import-CSV -Path "myBigCSV.csv"

foreach ($record in $records)
{
foreach ($domain in $domains)
{
if ($record.Key -eq $domain)
{
$matchingRecords.Add($record)
$domains.Remove($record) # Remove the domain that has just been found from the ArrayList, as it is no longer needed and optimizes the next iteration
break # Ignore remaining domains (if any) and process next record
}
}
}

That would remove a lot of unneccesary iterations from the process.
Cruisader03User is Offline
Basic Member
Basic Member
Posts:243
Avatar

--
03 Mar 2010 02:40 PM  
Filter it as the data is being imported. As soon as it hits a record that is already in $domains it continues to the next line of the large file. No extra breaks or reads to be done. No re-iterations after the data is imported either. Also, notice that $domains is a simple array of strings for the domains csv file. This will help with speed and keep the script less complex.

$domains = Get-Content Domains.csv
$records = Import-Csv AllRecords.csv | Where {$domains -contains $_.Domain}
$records
When at first you don't succeed Step-Into

http://theposherlife.blogspot.com
http://www.jandctravels.com

Paul-BUser is Offline
New Member
New Member
Posts:19
Avatar

--
03 Mar 2010 06:38 PM  
Wow simple and fast. I'd not thought of using 'where' like that

Thank you very much

Paul
You are not authorized to post a reply.

Active Forums 4.1
right
   
footer Sponsored by Quest Software • SAPIEN Technologies • Compellent • Microsoft Windows Server 2008 footer
footer